• Survey Paper
  • Open access
  • Published: 01 July 2020

Cybersecurity data science: an overview from machine learning perspective

  • Iqbal H. Sarker   ORCID: orcid.org/0000-0003-1740-5517 1 , 2 ,
  • A. S. M. Kayes 3 ,
  • Shahriar Badsha 4 ,
  • Hamed Alqahtani 5 ,
  • Paul Watters 3 &
  • Alex Ng 3  

Journal of Big Data volume  7 , Article number:  41 ( 2020 ) Cite this article

144k Accesses

242 Citations

51 Altmetric

Metrics details

In a computing context, cybersecurity is undergoing massive shifts in technology and its operations in recent days, and data science is driving the change. Extracting security incident patterns or insights from cybersecurity data and building corresponding data-driven model , is the key to make a security system automated and intelligent. To understand and analyze the actual phenomena with data, various scientific methods, machine learning techniques, processes, and systems are used, which is commonly known as data science. In this paper, we focus and briefly discuss on cybersecurity data science , where the data is being gathered from relevant cybersecurity sources, and the analytics complement the latest data-driven patterns for providing more effective security solutions. The concept of cybersecurity data science allows making the computing process more actionable and intelligent as compared to traditional ones in the domain of cybersecurity. We then discuss and summarize a number of associated research issues and future directions . Furthermore, we provide a machine learning based multi-layered framework for the purpose of cybersecurity modeling. Overall, our goal is not only to discuss cybersecurity data science and relevant methods but also to focus the applicability towards data-driven intelligent decision making for protecting the systems from cyber-attacks.

Introduction

Due to the increasing dependency on digitalization and Internet-of-Things (IoT) [ 1 ], various security incidents such as unauthorized access [ 2 ], malware attack [ 3 ], zero-day attack [ 4 ], data breach [ 5 ], denial of service (DoS) [ 2 ], social engineering or phishing [ 6 ] etc. have grown at an exponential rate in recent years. For instance, in 2010, there were less than 50 million unique malware executables known to the security community. By 2012, they were double around 100 million, and in 2019, there are more than 900 million malicious executables known to the security community, and this number is likely to grow, according to the statistics of AV-TEST institute in Germany [ 7 ]. Cybercrime and attacks can cause devastating financial losses and affect organizations and individuals as well. It’s estimated that, a data breach costs 8.19 million USD for the United States and 3.9 million USD on an average [ 8 ], and the annual cost to the global economy from cybercrime is 400 billion USD [ 9 ]. According to Juniper Research [ 10 ], the number of records breached each year to nearly triple over the next 5 years. Thus, it’s essential that organizations need to adopt and implement a strong cybersecurity approach to mitigate the loss. According to [ 11 ], the national security of a country depends on the business, government, and individual citizens having access to applications and tools which are highly secure, and the capability on detecting and eliminating such cyber-threats in a timely way. Therefore, to effectively identify various cyber incidents either previously seen or unseen, and intelligently protect the relevant systems from such cyber-attacks, is a key issue to be solved urgently.

figure 1

Popularity trends of data science, machine learning and cybersecurity over time, where x-axis represents the timestamp information and y axis represents the corresponding popularity values

Cybersecurity is a set of technologies and processes designed to protect computers, networks, programs and data from attack, damage, or unauthorized access [ 12 ]. In recent days, cybersecurity is undergoing massive shifts in technology and its operations in the context of computing, and data science (DS) is driving the change, where machine learning (ML), a core part of “Artificial Intelligence” (AI) can play a vital role to discover the insights from data. Machine learning can significantly change the cybersecurity landscape and data science is leading a new scientific paradigm [ 13 , 14 ]. The popularity of these related technologies is increasing day-by-day, which is shown in Fig.  1 , based on the data of the last five years collected from Google Trends [ 15 ]. The figure represents timestamp information in terms of a particular date in the x-axis and corresponding popularity in the range of 0 (minimum) to 100 (maximum) in the y-axis. As shown in Fig.  1 , the popularity indication values of these areas are less than 30 in 2014, while they exceed 70 in 2019, i.e., more than double in terms of increased popularity. In this paper, we focus on cybersecurity data science (CDS), which is broadly related to these areas in terms of security data processing techniques and intelligent decision making in real-world applications. Overall, CDS is security data-focused, applies machine learning methods to quantify cyber risks, and ultimately seeks to optimize cybersecurity operations. Thus, the purpose of this paper is for those academia and industry people who want to study and develop a data-driven smart cybersecurity model based on machine learning techniques. Therefore, great emphasis is placed on a thorough description of various types of machine learning methods, and their relations and usage in the context of cybersecurity. This paper does not describe all of the different techniques used in cybersecurity in detail; instead, it gives an overview of cybersecurity data science modeling based on artificial intelligence, particularly from machine learning perspective.

The ultimate goal of cybersecurity data science is data-driven intelligent decision making from security data for smart cybersecurity solutions. CDS represents a partial paradigm shift from traditional well-known security solutions such as firewalls, user authentication and access control, cryptography systems etc. that might not be effective according to today’s need in cyber industry [ 16 , 17 , 18 , 19 ]. The problems are these are typically handled statically by a few experienced security analysts, where data management is done in an ad-hoc manner [ 20 , 21 ]. However, as an increasing number of cybersecurity incidents in different formats mentioned above continuously appear over time, such conventional solutions have encountered limitations in mitigating such cyber risks. As a result, numerous advanced attacks are created and spread very quickly throughout the Internet. Although several researchers use various data analysis and learning techniques to build cybersecurity models that are summarized in “ Machine learning tasks in cybersecurity ” section, a comprehensive security model based on the effective discovery of security insights and latest security patterns could be more useful. To address this issue, we need to develop more flexible and efficient security mechanisms that can respond to threats and to update security policies to mitigate them intelligently in a timely manner. To achieve this goal, it is inherently required to analyze a massive amount of relevant cybersecurity data generated from various sources such as network and system sources, and to discover insights or proper security policies with minimal human intervention in an automated manner.

Analyzing cybersecurity data and building the right tools and processes to successfully protect against cybersecurity incidents goes beyond a simple set of functional requirements and knowledge about risks, threats or vulnerabilities. For effectively extracting the insights or the patterns of security incidents, several machine learning techniques, such as feature engineering, data clustering, classification, and association analysis, or neural network-based deep learning techniques can be used, which are briefly discussed in “ Machine learning tasks in cybersecurity ” section. These learning techniques are capable to find the anomalies or malicious behavior and data-driven patterns of associated security incidents to make an intelligent decision. Thus, based on the concept of data-driven decision making, we aim to focus on cybersecurity data science , where the data is being gathered from relevant cybersecurity sources such as network activity, database activity, application activity, or user activity, and the analytics complement the latest data-driven patterns for providing corresponding security solutions.

The contributions of this paper are summarized as follows.

We first make a brief discussion on the concept of cybersecurity data science and relevant methods to understand its applicability towards data-driven intelligent decision making in the domain of cybersecurity. For this purpose, we also make a review and brief discussion on different machine learning tasks in cybersecurity, and summarize various cybersecurity datasets highlighting their usage in different data-driven cyber applications.

We then discuss and summarize a number of associated research issues and future directions in the area of cybersecurity data science, that could help both the academia and industry people to further research and development in relevant application areas.

Finally, we provide a generic multi-layered framework of the cybersecurity data science model based on machine learning techniques. In this framework, we briefly discuss how the cybersecurity data science model can be used to discover useful insights from security data and making data-driven intelligent decisions to build smart cybersecurity systems.

The remainder of the paper is organized as follows. “ Background ” section summarizes background of our study and gives an overview of the related technologies of cybersecurity data science. “ Cybersecurity data science ” section defines and discusses briefly about cybersecurity data science including various categories of cyber incidents data. In “  Machine learning tasks in cybersecurity ” section, we briefly discuss various categories of machine learning techniques including their relations with cybersecurity tasks and summarize a number of machine learning based cybersecurity models in the field. “ Research issues and future directions ” section briefly discusses and highlights various research issues and future directions in the area of cybersecurity data science. In “  A multi-layered framework for smart cybersecurity services ” section, we suggest a machine learning-based framework to build cybersecurity data science model and discuss various layers with their roles. In “  Discussion ” section, we highlight several key points regarding our studies. Finally,  “ Conclusion ” section concludes this paper.

In this section, we give an overview of the related technologies of cybersecurity data science including various types of cybersecurity incidents and defense strategies.

  • Cybersecurity

Over the last half-century, the information and communication technology (ICT) industry has evolved greatly, which is ubiquitous and closely integrated with our modern society. Thus, protecting ICT systems and applications from cyber-attacks has been greatly concerned by the security policymakers in recent days [ 22 ]. The act of protecting ICT systems from various cyber-threats or attacks has come to be known as cybersecurity [ 9 ]. Several aspects are associated with cybersecurity: measures to protect information and communication technology; the raw data and information it contains and their processing and transmitting; associated virtual and physical elements of the systems; the degree of protection resulting from the application of those measures; and eventually the associated field of professional endeavor [ 23 ]. Craigen et al. defined “cybersecurity as a set of tools, practices, and guidelines that can be used to protect computer networks, software programs, and data from attack, damage, or unauthorized access” [ 24 ]. According to Aftergood et al. [ 12 ], “cybersecurity is a set of technologies and processes designed to protect computers, networks, programs and data from attacks and unauthorized access, alteration, or destruction”. Overall, cybersecurity concerns with the understanding of diverse cyber-attacks and devising corresponding defense strategies that preserve several properties defined as below [ 25 , 26 ].

Confidentiality is a property used to prevent the access and disclosure of information to unauthorized individuals, entities or systems.

Integrity is a property used to prevent any modification or destruction of information in an unauthorized manner.

Availability is a property used to ensure timely and reliable access of information assets and systems to an authorized entity.

The term cybersecurity applies in a variety of contexts, from business to mobile computing, and can be divided into several common categories. These are - network security that mainly focuses on securing a computer network from cyber attackers or intruders; application security that takes into account keeping the software and the devices free of risks or cyber-threats; information security that mainly considers security and the privacy of relevant data; operational security that includes the processes of handling and protecting data assets. Typical cybersecurity systems are composed of network security systems and computer security systems containing a firewall, antivirus software, or an intrusion detection system [ 27 ].

Cyberattacks and security risks

The risks typically associated with any attack, which considers three security factors, such as threats, i.e., who is attacking, vulnerabilities, i.e., the weaknesses they are attacking, and impacts, i.e., what the attack does [ 9 ]. A security incident is an act that threatens the confidentiality, integrity, or availability of information assets and systems. Several types of cybersecurity incidents that may result in security risks on an organization’s systems and networks or an individual [ 2 ]. These are:

Unauthorized access that describes the act of accessing information to network, systems or data without authorization that results in a violation of a security policy [ 2 ];

Malware known as malicious software, is any program or software that intentionally designed to cause damage to a computer, client, server, or computer network, e.g., botnets. Examples of different types of malware including computer viruses, worms, Trojan horses, adware, ransomware, spyware, malicious bots, etc. [ 3 , 26 ]; Ransom malware, or ransomware , is an emerging form of malware that prevents users from accessing their systems or personal files, or the devices, then demands an anonymous online payment in order to restore access.

Denial-of-Service is an attack meant to shut down a machine or network, making it inaccessible to its intended users by flooding the target with traffic that triggers a crash. The Denial-of-Service (DoS) attack typically uses one computer with an Internet connection, while distributed denial-of-service (DDoS) attack uses multiple computers and Internet connections to flood the targeted resource [ 2 ];

Phishing a type of social engineering , used for a broad range of malicious activities accomplished through human interactions, in which the fraudulent attempt takes part to obtain sensitive information such as banking and credit card details, login credentials, or personally identifiable information by disguising oneself as a trusted individual or entity via an electronic communication such as email, text, or instant message, etc. [ 26 ];

Zero-day attack is considered as the term that is used to describe the threat of an unknown security vulnerability for which either the patch has not been released or the application developers were unaware [ 4 , 28 ].

Beside these attacks mentioned above, privilege escalation [ 29 ], password attack [ 30 ], insider threat [ 31 ], man-in-the-middle [ 32 ], advanced persistent threat [ 33 ], SQL injection attack [ 34 ], cryptojacking attack [ 35 ], web application attack [ 30 ] etc. are well-known as security incidents in the field of cybersecurity. A data breach is another type of security incident, known as a data leak, which is involved in the unauthorized access of data by an individual, application, or service [ 5 ]. Thus, all data breaches are considered as security incidents, however, all the security incidents are not data breaches. Most data breaches occur in the banking industry involving the credit card numbers, personal information, followed by the healthcare sector and the public sector [ 36 ].

Cybersecurity defense strategies

Defense strategies are needed to protect data or information, information systems, and networks from cyber-attacks or intrusions. More granularly, they are responsible for preventing data breaches or security incidents and monitoring and reacting to intrusions, which can be defined as any kind of unauthorized activity that causes damage to an information system [ 37 ]. An intrusion detection system (IDS) is typically represented as “a device or software application that monitors a computer network or systems for malicious activity or policy violations” [ 38 ]. The traditional well-known security solutions such as anti-virus, firewalls, user authentication, access control, data encryption and cryptography systems, however might not be effective according to today’s need in the cyber industry

[ 16 , 17 , 18 , 19 ]. On the other hand, IDS resolves the issues by analyzing security data from several key points in a computer network or system [ 39 , 40 ]. Moreover, intrusion detection systems can be used to detect both internal and external attacks.

Intrusion detection systems are different categories according to the usage scope. For instance, a host-based intrusion detection system (HIDS), and network intrusion detection system (NIDS) are the most common types based on the scope of single computers to large networks. In a HIDS, the system monitors important files on an individual system, while it analyzes and monitors network connections for suspicious traffic in a NIDS. Similarly, based on methodologies, the signature-based IDS, and anomaly-based IDS are the most well-known variants [ 37 ].

Signature-based IDS : A signature can be a predefined string, pattern, or rule that corresponds to a known attack. A particular pattern is identified as the detection of corresponding attacks in a signature-based IDS. An example of a signature can be known patterns or a byte sequence in a network traffic, or sequences used by malware. To detect the attacks, anti-virus software uses such types of sequences or patterns as a signature while performing the matching operation. Signature-based IDS is also known as knowledge-based or misuse detection [ 41 ]. This technique can be efficient to process a high volume of network traffic, however, is strictly limited to the known attacks only. Thus, detecting new attacks or unseen attacks is one of the biggest challenges faced by this signature-based system.

Anomaly-based IDS : The concept of anomaly-based detection overcomes the issues of signature-based IDS discussed above. In an anomaly-based intrusion detection system, the behavior of the network is first examined to find dynamic patterns, to automatically create a data-driven model, to profile the normal behavior, and thus it detects deviations in the case of any anomalies [ 41 ]. Thus, anomaly-based IDS can be treated as a dynamic approach, which follows behavior-oriented detection. The main advantage of anomaly-based IDS is the ability to identify unknown or zero-day attacks [ 42 ]. However, the issue is that the identified anomaly or abnormal behavior is not always an indicator of intrusions. It sometimes may happen because of several factors such as policy changes or offering a new service.

In addition, a hybrid detection approach [ 43 , 44 ] that takes into account both the misuse and anomaly-based techniques discussed above can be used to detect intrusions. In a hybrid system, the misuse detection system is used for detecting known types of intrusions and anomaly detection system is used for novel attacks [ 45 ]. Beside these approaches, stateful protocol analysis can also be used to detect intrusions that identifies deviations of protocol state similarly to the anomaly-based method, however it uses predetermined universal profiles based on accepted definitions of benign activity [ 41 ]. In Table 1 , we have summarized these common approaches highlighting their pros and cons. Once the detecting has been completed, the intrusion prevention system (IPS) that is intended to prevent malicious events, can be used to mitigate the risks in different ways such as manual, providing notification, or automatic process [ 46 ]. Among these approaches, an automatic response system could be more effective as it does not involve a human interface between the detection and response systems.

  • Data science

We are living in the age of data, advanced analytics, and data science, which are related to data-driven intelligent decision making. Although, the process of searching patterns or discovering hidden and interesting knowledge from data is known as data mining [ 47 ], in this paper, we use the broader term “data science” rather than data mining. The reason is that, data science, in its most fundamental form, is all about understanding of data. It involves studying, processing, and extracting valuable insights from a set of information. In addition to data mining, data analytics is also related to data science. The development of data mining, knowledge discovery, and machine learning that refers creating algorithms and program which learn on their own, together with the original data analysis and descriptive analytics from the statistical perspective, forms the general concept of “data analytics” [ 47 ]. Nowadays, many researchers use the term “data science” to describe the interdisciplinary field of data collection, preprocessing, inferring, or making decisions by analyzing the data. To understand and analyze the actual phenomena with data, various scientific methods, machine learning techniques, processes, and systems are used, which is commonly known as data science. According to Cao et al. [ 47 ] “data science is a new interdisciplinary field that synthesizes and builds on statistics, informatics, computing, communication, management, and sociology to study data and its environments, to transform data to insights and decisions by following a data-to-knowledge-to-wisdom thinking and methodology”. As a high-level statement in the context of cybersecurity, we can conclude that it is the study of security data to provide data-driven solutions for the given security problems, as known as “the science of cybersecurity data”. Figure 2 shows the typical data-to-insight-to-decision transfer at different periods and general analytic stages in data science, in terms of a variety of analytics goals (G) and approaches (A) to achieve the data-to-decision goal [ 47 ].

figure 2

Data-to-insight-to-decision analytic stages in data science [ 47 ]

Based on the analytic power of data science including machine learning techniques, it can be a viable component of security strategies. By using data science techniques, security analysts can manipulate and analyze security data more effectively and efficiently, uncovering valuable insights from data. Thus, data science methodologies including machine learning techniques can be well utilized in the context of cybersecurity, in terms of problem understanding, gathering security data from diverse sources, preparing data to feed into the model, data-driven model building and updating, for providing smart security services, which motivates to define cybersecurity data science and to work in this research area.

Cybersecurity data science

In this section, we briefly discuss cybersecurity data science including various categories of cyber incidents data with the usage in different application areas, and the key terms and areas related to our study.

Understanding cybersecurity data

Data science is largely driven by the availability of data [ 48 ]. Datasets typically represent a collection of information records that consist of several attributes or features and related facts, in which cybersecurity data science is based on. Thus, it’s important to understand the nature of cybersecurity data containing various types of cyberattacks and relevant features. The reason is that raw security data collected from relevant cyber sources can be used to analyze the various patterns of security incidents or malicious behavior, to build a data-driven security model to achieve our goal. Several datasets exist in the area of cybersecurity including intrusion analysis, malware analysis, anomaly, fraud, or spam analysis that are used for various purposes. In Table 2 , we summarize several such datasets including their various features and attacks that are accessible on the Internet, and highlight their usage based on machine learning techniques in different cyber applications. Effectively analyzing and processing of these security features, building target machine learning-based security model according to the requirements, and eventually, data-driven decision making, could play a role to provide intelligent cybersecurity services that are discussed briefly in “ A multi-layered framework for smart cybersecurity services ” section.

Defining cybersecurity data science

Data science is transforming the world’s industries. It is critically important for the future of intelligent cybersecurity systems and services because of “security is all about data”. When we seek to detect cyber threats, we are analyzing the security data in the form of files, logs, network packets, or other relevant sources. Traditionally, security professionals didn’t use data science techniques to make detections based on these data sources. Instead, they used file hashes, custom-written rules like signatures, or manually defined heuristics [ 21 ]. Although these techniques have their own merits in several cases, it needs too much manual work to keep up with the changing cyber threat landscape. On the contrary, data science can make a massive shift in technology and its operations, where machine learning algorithms can be used to learn or extract insight of security incident patterns from the training data for their detection and prevention. For instance, to detect malware or suspicious trends, or to extract policy rules, these techniques can be used.

In recent days, the entire security industry is moving towards data science, because of its capability to transform raw data into decision making. To do this, several data-driven tasks can be associated, such as—(i) data engineering focusing practical applications of data gathering and analysis; (ii) reducing data volume that deals with filtering significant and relevant data to further analysis; (iii) discovery and detection that focuses on extracting insight or incident patterns or knowledge from data; (iv) automated models that focus on building data-driven intelligent security model; (v) targeted security  alerts focusing on the generation of remarkable security alerts based on discovered knowledge that minimizes the false alerts, and (vi) resource optimization that deals with the available resources to achieve the target goals in a security system. While making data-driven decisions, behavioral analysis could also play a significant role in the domain of cybersecurity [ 81 ].

Thus, the concept of cybersecurity data science incorporates the methods and techniques of data science and machine learning as well as the behavioral analytics of various security incidents. The combination of these technologies has given birth to the term “cybersecurity data science”, which refers to collect a large amount of security event data from different sources and analyze it using machine learning technologies for detecting security risks or attacks either through the discovery of useful insights or the latest data-driven patterns. It is, however, worth remembering that cybersecurity data science is not just about a collection of machine learning algorithms, rather,  a process that can help security professionals or analysts to scale and automate their security activities in a smart way and in a timely manner. Therefore, the formal definition can be as follows: “Cybersecurity data science is a research or working area existing at the intersection of cybersecurity, data science, and machine learning or artificial intelligence, which is mainly security data-focused, applies machine learning methods, attempts to quantify cyber-risks or incidents, and promotes inferential techniques to analyze behavioral patterns in security data. It also focuses on generating security response alerts, and eventually seeks for optimizing cybersecurity solutions, to build automated and intelligent cybersecurity systems.”

Table  3 highlights some key terms associated with cybersecurity data science. Overall, the outputs of cybersecurity data science are typically security data products, which can be a data-driven security model, policy rule discovery, risk or attack prediction, potential security service and recommendation, or the corresponding security system depending on the given security problem in the domain of cybersecurity. In the next section, we briefly discuss various machine learning tasks with examples within the scope of our study.

Machine learning tasks in cybersecurity

Machine learning (ML) is typically considered as a branch of “Artificial Intelligence”, which is closely related to computational statistics, data mining and analytics, data science, particularly focusing on making the computers to learn from data [ 82 , 83 ]. Thus, machine learning models typically comprise of a set of rules, methods, or complex “transfer functions” that can be applied to find interesting data patterns, or to recognize or predict behavior [ 84 ], which could play an important role in the area of cybersecurity. In the following, we discuss different methods that can be used to solve machine learning tasks and how they are related to cybersecurity tasks.

Supervised learning

Supervised learning is performed when specific targets are defined to reach from a certain set of inputs, i.e., task-driven approach. In the area of machine learning, the most popular supervised learning techniques are known as classification and regression methods [ 129 ]. These techniques are popular to classify or predict the future for a particular security problem. For instance, to predict denial-of-service attack (yes, no) or to identify different classes of network attacks such as scanning and spoofing, classification techniques can be used in the cybersecurity domain. ZeroR [ 83 ], OneR [ 130 ], Navies Bayes [ 131 ], Decision Tree [ 132 , 133 ], K-nearest neighbors [ 134 ], support vector machines [ 135 ], adaptive boosting [ 136 ], and logistic regression [ 137 ] are the well-known classification techniques. In addition, recently Sarker et al. have proposed BehavDT [ 133 ], and IntruDtree [ 106 ] classification techniques that are able to effectively build a data-driven predictive model. On the other hand, to predict the continuous or numeric value, e.g., total phishing attacks in a certain period or predicting the network packet parameters, regression techniques are useful. Regression analyses can also be used to detect the root causes of cybercrime and other types of fraud [ 138 ]. Linear regression [ 82 ], support vector regression [ 135 ] are the popular regression techniques. The main difference between classification and regression is that the output variable in the regression is numerical or continuous, while the predicted output for classification is categorical or discrete. Ensemble learning is an extension of supervised learning while mixing different simple models, e.g., Random Forest learning [ 139 ] that generates multiple decision trees to solve a particular security task.

Unsupervised learning

In unsupervised learning problems, the main task is to find patterns, structures, or knowledge in unlabeled data, i.e., data-driven approach [ 140 ]. In the area of cybersecurity, cyber-attacks like malware stays hidden in some ways, include changing their behavior dynamically and autonomously to avoid detection. Clustering techniques, a type of unsupervised learning, can help to uncover the hidden patterns and structures from the datasets, to identify indicators of such sophisticated attacks. Similarly, in identifying anomalies, policy violations, detecting, and eliminating noisy instances in data, clustering techniques can be useful. K-means [ 141 ], K-medoids [ 142 ] are the popular partitioning clustering algorithms, and single linkage [ 143 ] or complete linkage [ 144 ] are the well-known hierarchical clustering algorithms used in various application domains. Moreover, a bottom-up clustering approach proposed by Sarker et al. [ 145 ] can also be used by taking into account the data characteristics.

Besides, feature engineering tasks like optimal feature selection or extraction related to a particular security problem could be useful for further analysis [ 106 ]. Recently, Sarker et al. [ 106 ] have proposed an approach for selecting security features according to their importance score values. Moreover, Principal component analysis, linear discriminant analysis, pearson correlation analysis, or non-negative matrix factorization are the popular dimensionality reduction techniques to solve such issues [ 82 ]. Association rule learning is another example, where machine learning based policy rules can prevent cyber-attacks. In an expert system, the rules are usually manually defined by a knowledge engineer working in collaboration with a domain expert [ 37 , 140 , 146 ]. Association rule learning on the contrary, is the discovery of rules or relationships among a set of available security features or attributes in a given dataset [ 147 ]. To quantify the strength of relationships, correlation analysis can be used [ 138 ]. Many association rule mining algorithms have been proposed in the area of machine learning and data mining literature, such as logic-based [ 148 ], frequent pattern based [ 149 , 150 , 151 ], tree-based [ 152 ], etc. Recently, Sarker et al. [ 153 ] have proposed an association rule learning approach considering non-redundant generation, that can be used to discover a set of useful security policy rules. Moreover, AIS [ 147 ], Apriori [ 149 ], Apriori-TID and Apriori-Hybrid [ 149 ], FP-Tree [ 152 ], and RARM [ 154 ], and Eclat [ 155 ] are the well-known association rule learning algorithms that are capable to solve such problems by generating a set of policy rules in the domain of cybersecurity.

Neural networks and deep learning

Deep learning is a part of machine learning in the area of artificial intelligence, which is a computational model that is inspired by the biological neural networks in the human brain [ 82 ]. Artificial Neural Network (ANN) is frequently used in deep learning and the most popular neural network algorithm is backpropagation [ 82 ]. It performs learning on a multi-layer feed-forward neural network consists of an input layer, one or more hidden layers, and an output layer. The main difference between deep learning and classical machine learning is its performance on the amount of security data increases. Typically deep learning algorithms perform well when the data volumes are large, whereas machine learning algorithms perform comparatively better on small datasets [ 44 ]. In our earlier work, Sarker et al. [ 129 ], we have illustrated the effectiveness of these approaches considering contextual datasets. However, deep learning approaches mimic the human brain mechanism to interpret large amount of data or the complex data such as images, sounds and texts [ 44 , 129 ]. In terms of feature extraction to build models, deep learning reduces the effort of designing a feature extractor for each problem than the classical machine learning techniques. Beside these characteristics, deep learning typically takes a long time to train an algorithm than a machine learning algorithm, however, the test time is exactly the opposite [ 44 ]. Thus, deep learning relies more on high-performance machines with GPUs than classical machine-learning algorithms [ 44 , 156 ]. The most popular deep neural network learning models include multi-layer perceptron (MLP) [ 157 ], convolutional neural network (CNN) [ 158 ], recurrent neural network (RNN) or long-short term memory (LSTM) network [ 121 , 158 ]. In recent days, researchers use these deep learning techniques for different purposes such as detecting network intrusions, malware traffic detection and classification, etc. in the domain of cybersecurity [ 44 , 159 ].

Other learning techniques

Semi-supervised learning can be described as a hybridization of supervised and unsupervised techniques discussed above, as it works on both the labeled and unlabeled data. In the area of cybersecurity, it could be useful, when it requires to label data automatically without human intervention, to improve the performance of cybersecurity models. Reinforcement techniques are another type of machine learning that characterizes an agent by creating its own learning experiences through interacting directly with the environment, i.e., environment-driven approach, where the environment is typically formulated as a Markov decision process and take decision based on a reward function [ 160 ]. Monte Carlo learning, Q-learning, Deep Q Networks, are the most common reinforcement learning algorithms [ 161 ]. For instance, in a recent work [ 126 ], the authors present an approach for detecting botnet traffic or malicious cyber activities using reinforcement learning combining with neural network classifier. In another work [ 128 ], the authors discuss about the application of deep reinforcement learning to intrusion detection for supervised problems, where they received the best results for the Deep Q-Network algorithm. In the context of cybersecurity, genetic algorithms that use fitness, selection, crossover, and mutation for finding optimization, could also be used to solve a similar class of learning problems [ 119 ].

Various types of machine learning techniques discussed above can be useful in the domain of cybersecurity, to build an effective security model. In Table  4 , we have summarized several machine learning techniques that are used to build various types of security models for various purposes. Although these models typically represent a learning-based security model, in this paper, we aim to focus on a comprehensive cybersecurity data science model and relevant issues, in order to build a data-driven intelligent security system. In the next section, we highlight several research issues and potential solutions in the area of cybersecurity data science.

Research issues and future directions

Our study opens several research issues and challenges in the area of cybersecurity data science to extract insight from relevant data towards data-driven intelligent decision making for cybersecurity solutions. In the following, we summarize these challenges ranging from data collection to decision making.

Cybersecurity datasets : Source datasets are the primary component to work in the area of cybersecurity data science. Most of the existing datasets are old and might insufficient in terms of understanding the recent behavioral patterns of various cyber-attacks. Although the data can be transformed into a meaningful understanding level after performing several processing tasks, there is still a lack of understanding of the characteristics of recent attacks and their patterns of happening. Thus, further processing or machine learning algorithms may provide a low accuracy rate for making the target decisions. Therefore, establishing a large number of recent datasets for a particular problem domain like cyber risk prediction or intrusion detection is needed, which could be one of the major challenges in cybersecurity data science.

Handling quality problems in cybersecurity datasets : The cyber datasets might be noisy, incomplete, insignificant, imbalanced, or may contain inconsistency instances related to a particular security incident. Such problems in a data set may affect the quality of the learning process and degrade the performance of the machine learning-based models [ 162 ]. To make a data-driven intelligent decision for cybersecurity solutions, such problems in data is needed to deal effectively before building the cyber models. Therefore, understanding such problems in cyber data and effectively handling such problems using existing algorithms or newly proposed algorithm for a particular problem domain like malware analysis or intrusion detection and prevention is needed, which could be another research issue in cybersecurity data science.

Security policy rule generation : Security policy rules reference security zones and enable a user to allow, restrict, and track traffic on the network based on the corresponding user or user group, and service, or the application. The policy rules including the general and more specific rules are compared against the incoming traffic in sequence during the execution, and the rule that matches the traffic is applied. The policy rules used in most of the cybersecurity systems are static and generated by human expertise or ontology-based [ 163 , 164 ]. Although, association rule learning techniques produce rules from data, however, there is a problem of redundancy generation [ 153 ] that makes the policy rule-set complex. Therefore, understanding such problems in policy rule generation and effectively handling such problems using existing algorithms or newly proposed algorithm for a particular problem domain like access control [ 165 ] is needed, which could be another research issue in cybersecurity data science.

Hybrid learning method : Most commercial products in the cybersecurity domain contain signature-based intrusion detection techniques [ 41 ]. However, missing features or insufficient profiling can cause these techniques to miss unknown attacks. In that case, anomaly-based detection techniques or hybrid technique combining signature-based and anomaly-based can be used to overcome such issues. A hybrid technique combining multiple learning techniques or a combination of deep learning and machine-learning methods can be used to extract the target insight for a particular problem domain like intrusion detection, malware analysis, access control, etc. and make the intelligent decision for corresponding cybersecurity solutions.

Protecting the valuable security information : Another issue of a cyber data attack is the loss of extremely valuable data and information, which could be damaging for an organization. With the use of encryption or highly complex signatures, one can stop others from probing into a dataset. In such cases, cybersecurity data science can be used to build a data-driven impenetrable protocol to protect such security information. To achieve this goal, cyber analysts can develop algorithms by analyzing the history of cyberattacks to detect the most frequently targeted chunks of data. Thus, understanding such data protecting problems and designing corresponding algorithms to effectively handling these problems, could be another research issue in the area of cybersecurity data science.

Context-awareness in cybersecurity : Existing cybersecurity work mainly originates from the relevant cyber data containing several low-level features. When data mining and machine learning techniques are applied to such datasets, a related pattern can be identified that describes it properly. However, a broader contextual information [ 140 , 145 , 166 ] like temporal, spatial, relationship among events or connections, dependency can be used to decide whether there exists a suspicious activity or not. For instance, some approaches may consider individual connections as DoS attacks, while security experts might not treat them as malicious by themselves. Thus, a significant limitation of existing cybersecurity work is the lack of using the contextual information for predicting risks or attacks. Therefore, context-aware adaptive cybersecurity solutions could be another research issue in cybersecurity data science.

Feature engineering in cybersecurity : The efficiency and effectiveness of a machine learning-based security model has always been a major challenge due to the high volume of network data with a large number of traffic features. The large dimensionality of data has been addressed using several techniques such as principal component analysis (PCA) [ 167 ], singular value decomposition (SVD) [ 168 ] etc. In addition to low-level features in the datasets, the contextual relationships between suspicious activities might be relevant. Such contextual data can be stored in an ontology or taxonomy for further processing. Thus how to effectively select the optimal features or extract the significant features considering both the low-level features as well as the contextual features, for effective cybersecurity solutions could be another research issue in cybersecurity data science.

Remarkable security alert generation and prioritizing : In many cases, the cybersecurity system may not be well defined and may cause a substantial number of false alarms that are unexpected in an intelligent system. For instance, an IDS deployed in a real-world network generates around nine million alerts per day [ 169 ]. A network-based intrusion detection system typically looks at the incoming traffic for matching the associated patterns to detect risks, threats or vulnerabilities and generate security alerts. However, to respond to each such alert might not be effective as it consumes relatively huge amounts of time and resources, and consequently may result in a self-inflicted DoS. To overcome this problem, a high-level management is required that correlate the security alerts considering the current context and their logical relationship including their prioritization before reporting them to users, which could be another research issue in cybersecurity data science.

Recency analysis in cybersecurity solutions : Machine learning-based security models typically use a large amount of static data to generate data-driven decisions. Anomaly detection systems rely on constructing such a model considering normal behavior and anomaly, according to their patterns. However, normal behavior in a large and dynamic security system is not well defined and it may change over time, which can be considered as an incremental growing of dataset. The patterns in incremental datasets might be changed in several cases. This often results in a substantial number of false alarms known as false positives. Thus, a recent malicious behavioral pattern is more likely to be interesting and significant than older ones for predicting unknown attacks. Therefore, effectively using the concept of recency analysis [ 170 ] in cybersecurity solutions could be another issue in cybersecurity data science.

The most important work for an intelligent cybersecurity system is to develop an effective framework that supports data-driven decision making. In such a framework, we need to consider advanced data analysis based on machine learning techniques, so that the framework is capable to minimize these issues and to provide automated and intelligent security services. Thus, a well-designed security framework for cybersecurity data and the experimental evaluation is a very important direction and a big challenge as well. In the next section, we suggest and discuss a data-driven cybersecurity framework based on machine learning techniques considering multiple processing layers.

A multi-layered framework for smart cybersecurity services

As discussed earlier, cybersecurity data science is data-focused, applies machine learning methods, attempts to quantify cyber risks, promotes inferential techniques to analyze behavioral patterns, focuses on generating security response alerts, and eventually seeks for optimizing cybersecurity operations. Hence, we briefly discuss a multiple data processing layered framework that potentially can be used to discover security insights from the raw data to build smart cybersecurity systems, e.g., dynamic policy rule-based access control or intrusion detection and prevention system. To make a data-driven intelligent decision in the resultant cybersecurity system, understanding the security problems and the nature of corresponding security data and their vast analysis is needed. For this purpose, our suggested framework not only considers the machine learning techniques to build the security model but also takes into account the incremental learning and dynamism to keep the model up-to-date and corresponding response generation, which could be more effective and intelligent for providing the expected services. Figure 3 shows an overview of the framework, involving several processing layers, from raw security event data to services. In the following, we briefly discuss the working procedure of the framework.

figure 3

A generic multi-layered framework based on machine learning techniques for smart cybersecurity services

Security data collecting

Collecting valuable cybersecurity data is a crucial step, which forms a connecting link between security problems in cyberinfrastructure and corresponding data-driven solution steps in this framework, shown in Fig.  3 . The reason is that cyber data can serve as the source for setting up ground truth of the security model that affect the model performance. The quality and quantity of cyber data decide the feasibility and effectiveness of solving the security problem according to our goal. Thus, the concern is how to collect valuable and unique needs data for building the data-driven security models.

The general step to collect and manage security data from diverse data sources is based on a particular security problem and project within the enterprise. Data sources can be classified into several broad categories such as network, host, and hybrid [ 171 ]. Within the network infrastructure, the security system can leverage different types of security data such as IDS logs, firewall logs, network traffic data, packet data, and honeypot data, etc. for providing the target security services. For instance, a given IP is considered malicious or not, could be detected by performing data analysis utilizing the data of IP addresses and their cyber activities. In the domain of cybersecurity, the network source mentioned above is considered as the primary security event source to analyze. In the host category, it collects data from an organization’s host machines, where the data sources can be operating system logs, database access logs, web server logs, email logs, application logs, etc. Collecting data from both the network and host machines are considered a hybrid category. Overall, in a data collection layer the network activity, database activity, application activity, and user activity can be the possible security event sources in the context of cybersecurity data science.

Security data preparing

After collecting the raw security data from various sources according to the problem domain discussed above, this layer is responsible to prepare the raw data for building the model by applying various necessary processes. However, not all of the collected data contributes to the model building process in the domain of cybersecurity [ 172 ]. Therefore, the useless data should be removed from the rest of the data captured by the network sniffer. Moreover, data might be noisy, have missing or corrupted values, or have attributes of widely varying types and scales. High quality of data is necessary for achieving higher accuracy in a data-driven model, which is a process of learning a function that maps an input to an output based on example input-output pairs. Thus, it might require a procedure for data cleaning, handling missing or corrupted values. Moreover, security data features or attributes can be in different types, such as continuous, discrete, or symbolic [ 106 ]. Beyond a solid understanding of these types of data and attributes and their permissible operations, its need to preprocess the data and attributes to convert into the target type. Besides, the raw data can be in different types such as structured, semi-structured, or unstructured, etc. Thus, normalization, transformation, or collation can be useful to organize the data in a structured manner. In some cases, natural language processing techniques might be useful depending on data type and characteristics, e.g., textual contents. As both the quality and quantity of data decide the feasibility of solving the security problem, effectively pre-processing and management of data and their representation can play a significant role to build an effective security model for intelligent services.

Machine learning-based security modeling

This is the core step where insights and knowledge are extracted from data through the application of cybersecurity data science. In this section, we particularly focus on machine learning-based modeling as machine learning techniques can significantly change the cybersecurity landscape. The security features or attributes and their patterns in data are of high interest to be discovered and analyzed to extract security insights. To achieve the goal, a deeper understanding of data and machine learning-based analytical models utilizing a large number of cybersecurity data can be effective. Thus, various machine learning tasks can be involved in this model building layer according to the solution perspective. These are - security feature engineering that mainly responsible to transform raw security data into informative features that effectively represent the underlying security problem to the data-driven models. Thus, several data-processing tasks such as feature transformation and normalization, feature selection by taking into account a subset of available security features according to their correlations or importance in modeling, or feature generation and extraction by creating new brand principal components, may be involved in this module according to the security data characteristics. For instance, the chi-squared test, analysis of variance test, correlation coefficient analysis, feature importance, as well as discriminant and principal component analysis, or singular value decomposition, etc. can be used for analyzing the significance of the security features to perform the security feature engineering tasks [ 82 ].

Another significant module is security data clustering that uncovers hidden patterns and structures through huge volumes of security data, to identify where the new threats exist. It typically involves the grouping of security data with similar characteristics, which can be used to solve several cybersecurity problems such as detecting anomalies, policy violations, etc. Malicious behavior or anomaly detection module is typically responsible to identify a deviation to a known behavior, where clustering-based analysis and techniques can also be used to detect malicious behavior or anomaly detection. In the cybersecurity area, attack classification or prediction is treated as one of the most significant modules, which is responsible to build a prediction model to classify attacks or threats and to predict future for a particular security problem. To predict denial-of-service attack or a spam filter separating tasks from other messages, could be the relevant examples. Association learning or policy rule generation module can play a role to build an expert security system that comprises several IF-THEN rules that define attacks. Thus, in a problem of policy rule generation for rule-based access control system, association learning can be used as it discovers the associations or relationships among a set of available security features in a given security dataset. The popular machine learning algorithms in these categories are briefly discussed in “  Machine learning tasks in cybersecurity ” section. The module model selection or customization is responsible to choose whether it uses the existing machine learning model or needed to customize. Analyzing data and building models based on traditional machine learning or deep learning methods, could achieve acceptable results in certain cases in the domain of cybersecurity. However, in terms of effectiveness and efficiency or other performance measurements considering time complexity, generalization capacity, and most importantly the impact of the algorithm on the detection rate of a system, machine learning models are needed to customize for a specific security problem. Moreover, customizing the related techniques and data could improve the performance of the resultant security model and make it better applicable in a cybersecurity domain. The modules discussed above can work separately and combinedly depending on the target security problems.

Incremental learning and dynamism

In our framework, this layer is concerned with finalizing the resultant security model by incorporating additional intelligence according to the needs. This could be possible by further processing in several modules. For instance, the post-processing and improvement module in this layer could play a role to simplify the extracted knowledge according to the particular requirements by incorporating domain-specific knowledge. As the attack classification or prediction models based on machine learning techniques strongly rely on the training data, it can hardly be generalized to other datasets, which could be significant for some applications. To address such kind of limitations, this module is responsible to utilize the domain knowledge in the form of taxonomy or ontology to improve attack correlation in cybersecurity applications.

Another significant module recency mining and updating security model is responsible to keep the security model up-to-date for better performance by extracting the latest data-driven security patterns. The extracted knowledge discussed in the earlier layer is based on a static initial dataset considering the overall patterns in the datasets. However, such knowledge might not be guaranteed higher performance in several cases, because of incremental security data with recent patterns. In many cases, such incremental data may contain different patterns which could conflict with existing knowledge. Thus, the concept of RecencyMiner [ 170 ] on incremental security data and extracting new patterns can be more effective than the existing old patterns. The reason is that recent security patterns and rules are more likely to be significant than older ones for predicting cyber risks or attacks. Rather than processing the whole security data again, recency-based dynamic updating according to the new patterns would be more efficient in terms of processing and outcome. This could make the resultant cybersecurity model intelligent and dynamic. Finally, response planning and decision making module is responsible to make decisions based on the extracted insights and take necessary actions to prevent the system from the cyber-attacks to provide automated and intelligent services. The services might be different depending on particular requirements for a given security problem.

Overall, this framework is a generic description which potentially can be used to discover useful insights from security data, to build smart cybersecurity systems, to address complex security challenges, such as intrusion detection, access control management, detecting anomalies and fraud, or denial of service attacks, etc. in the area of cybersecurity data science.

Although several research efforts have been directed towards cybersecurity solutions, discussed in “ Background ” , “ Cybersecurity data science ”, and “ Machine learning tasks in cybersecurity ” sections in different directions, this paper presents a comprehensive view of cybersecurity data science. For this, we have conducted a literature review to understand cybersecurity data, various defense strategies including intrusion detection techniques, different types of machine learning techniques in cybersecurity tasks. Based on our discussion on existing work, several research issues related to security datasets, data quality problems, policy rule generation, learning methods, data protection, feature engineering, security alert generation, recency analysis etc. are identified that require further research attention in the domain of cybersecurity data science.

The scope of cybersecurity data science is broad. Several data-driven tasks such as intrusion detection and prevention, access control management, security policy generation, anomaly detection, spam filtering, fraud detection and prevention, various types of malware attack detection and defense strategies, etc. can be considered as the scope of cybersecurity data science. Such tasks based categorization could be helpful for security professionals including the researchers and practitioners who are interested in the domain-specific aspects of security systems [ 171 ]. The output of cybersecurity data science can be used in many application areas such as Internet of things (IoT) security [ 173 ], network security [ 174 ], cloud security [ 175 ], mobile and web applications [ 26 ], and other relevant cyber areas. Moreover, intelligent cybersecurity solutions are important for the banking industry, the healthcare sector, or the public sector, where data breaches typically occur [ 36 , 176 ]. Besides, the data-driven security solutions could also be effective in AI-based blockchain technology, where AI works with huge volumes of security event data to extract the useful insights using machine learning techniques, and block-chain as a trusted platform to store such data [ 177 ].

Although in this paper, we discuss cybersecurity data science focusing on examining raw security data to data-driven decision making for intelligent security solutions, it could also be related to big data analytics in terms of data processing and decision making. Big data deals with data sets that are too large or complex having characteristics of high data volume, velocity, and variety. Big data analytics mainly has two parts consisting of data management involving data storage, and analytics [ 178 ]. The analytics typically describe the process of analyzing such datasets to discover patterns, unknown correlations, rules, and other useful insights [ 179 ]. Thus, several advanced data analysis techniques such as AI, data mining, machine learning could play an important role in processing big data by converting big problems to small problems [ 180 ]. To do this, the potential strategies like parallelization, divide-and-conquer, incremental learning, sampling, granular computing, feature or instance selection, can be used to make better decisions, reducing costs, or enabling more efficient processing. In such cases, the concept of cybersecurity data science, particularly machine learning-based modeling could be helpful for process automation and decision making for intelligent security solutions. Moreover, researchers could consider modified algorithms or models for handing big data on parallel computing platforms like Hadoop, Storm, etc. [ 181 ].

Based on the concept of cybersecurity data science discussed in the paper, building a data-driven security model for a particular security problem and relevant empirical evaluation to measure the effectiveness and efficiency of the model, and to asses the usability in the real-world application domain could be a future work.

Motivated by the growing significance of cybersecurity and data science, and machine learning technologies, in this paper, we have discussed how cybersecurity data science applies to data-driven intelligent decision making in smart cybersecurity systems and services. We also have discussed how it can impact security data, both in terms of extracting insight of security incidents and the dataset itself. We aimed to work on cybersecurity data science by discussing the state of the art concerning security incidents data and corresponding security services. We also discussed how machine learning techniques can impact in the domain of cybersecurity, and examine the security challenges that remain. In terms of existing research, much focus has been provided on traditional security solutions, with less available work in machine learning technique based security systems. For each common technique, we have discussed relevant security research. The purpose of this article is to share an overview of the conceptualization, understanding, modeling, and thinking about cybersecurity data science.

We have further identified and discussed various key issues in security analysis to showcase the signpost of future research directions in the domain of cybersecurity data science. Based on the knowledge, we have also provided a generic multi-layered framework of cybersecurity data science model based on machine learning techniques, where the data is being gathered from diverse sources, and the analytics complement the latest data-driven patterns for providing intelligent security services. The framework consists of several main phases - security data collecting, data preparation, machine learning-based security modeling, and incremental learning and dynamism for smart cybersecurity systems and services. We specifically focused on extracting insights from security data, from setting a research design with particular attention to concepts for data-driven intelligent security solutions.

Overall, this paper aimed not only to discuss cybersecurity data science and relevant methods but also to discuss the applicability towards data-driven intelligent decision making in cybersecurity systems and services from machine learning perspectives. Our analysis and discussion can have several implications both for security researchers and practitioners. For researchers, we have highlighted several issues and directions for future research. Other areas for potential research include empirical evaluation of the suggested data-driven model, and comparative analysis with other security systems. For practitioners, the multi-layered machine learning-based model can be used as a reference in designing intelligent cybersecurity systems for organizations. We believe that our study on cybersecurity data science opens a promising path and can be used as a reference guide for both academia and industry for future research and applications in the area of cybersecurity.

Availability of data and materials

Not applicable.

Abbreviations

  • Machine learning

Artificial Intelligence

Information and communication technology

Internet of Things

Distributed Denial of Service

Intrusion detection system

Intrusion prevention system

Host-based intrusion detection systems

Network Intrusion Detection Systems

Signature-based intrusion detection system

Anomaly-based intrusion detection system

Li S, Da Xu L, Zhao S. The internet of things: a survey. Inform Syst Front. 2015;17(2):243–59.

Google Scholar  

Sun N, Zhang J, Rimba P, Gao S, Zhang LY, Xiang Y. Data-driven cybersecurity incident prediction: a survey. IEEE Commun Surv Tutor. 2018;21(2):1744–72.

McIntosh T, Jang-Jaccard J, Watters P, Susnjak T. The inadequacy of entropy-based ransomware detection. In: International conference on neural information processing. New York: Springer; 2019. p. 181–189

Alazab M, Venkatraman S, Watters P, Alazab M, et al. Zero-day malware detection based on supervised learning algorithms of api call signatures (2010)

Shaw A. Data breach: from notification to prevention using pci dss. Colum Soc Probs. 2009;43:517.

Gupta BB, Tewari A, Jain AK, Agrawal DP. Fighting against phishing attacks: state of the art and future challenges. Neural Comput Appl. 2017;28(12):3629–54.

Av-test institute, germany, https://www.av-test.org/en/statistics/malware/ . Accessed 20 Oct 2019.

Ibm security report, https://www.ibm.com/security/data-breach . Accessed on 20 Oct 2019.

Fischer EA. Cybersecurity issues and challenges: In brief. Congressional Research Service (2014)

Juniper research. https://www.juniperresearch.com/ . Accessed on 20 Oct 2019.

Papastergiou S, Mouratidis H, Kalogeraki E-M. Cyber security incident handling, warning and response system for the european critical information infrastructures (cybersane). In: International Conference on Engineering Applications of Neural Networks, p. 476–487 (2019). New York: Springer

Aftergood S. Cybersecurity: the cold war online. Nature. 2017;547(7661):30.

Hey AJ, Tansley S, Tolle KM, et al. The fourth paradigm: data-intensive scientific discovery. 2009;1:

Cukier K. Data, data everywhere: A special report on managing information, 2010.

Google trends. In: https://trends.google.com/trends/ , 2019.

Anwar S, Mohamad Zain J, Zolkipli MF, Inayat Z, Khan S, Anthony B, Chang V. From intrusion detection to an intrusion response system: fundamentals, requirements, and future directions. Algorithms. 2017;10(2):39.

MATH   Google Scholar  

Mohammadi S, Mirvaziri H, Ghazizadeh-Ahsaee M, Karimipour H. Cyber intrusion detection by combined feature selection algorithm. J Inform Sec Appl. 2019;44:80–8.

Tapiador JE, Orfila A, Ribagorda A, Ramos B. Key-recovery attacks on kids, a keyed anomaly detection system. IEEE Trans Depend Sec Comput. 2013;12(3):312–25.

Tavallaee M, Stakhanova N, Ghorbani AA. Toward credible evaluation of anomaly-based intrusion-detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 40(5), 516–524 (2010)

Foroughi F, Luksch P. Data science methodology for cybersecurity projects. arXiv preprint arXiv:1803.04219 , 2018.

Saxe J, Sanders H. Malware data science: Attack detection and attribution, 2018.

Rainie L, Anderson J, Connolly J. Cyber attacks likely to increase. Digital Life in. 2014, vol. 2025.

Fischer EA. Creating a national framework for cybersecurity: an analysis of issues and options. LIBRARY OF CONGRESS WASHINGTON DC CONGRESSIONAL RESEARCH SERVICE, 2005.

Craigen D, Diakun-Thibault N, Purse R. Defining cybersecurity. Technology Innovation. Manag Rev. 2014;4(10):13–21.

Council NR. et al. Toward a safer and more secure cyberspace, 2007.

Jang-Jaccard J, Nepal S. A survey of emerging threats in cybersecurity. J Comput Syst Sci. 2014;80(5):973–93.

MathSciNet   MATH   Google Scholar  

Mukkamala S, Sung A, Abraham A. Cyber security challenges: Designing efficient intrusion detection systems and antivirus tools. Vemuri, V. Rao, Enhancing Computer Security with Smart Technology.(Auerbach, 2006), 125–163, 2005.

Bilge L, Dumitraş T. Before we knew it: an empirical study of zero-day attacks in the real world. In: Proceedings of the 2012 ACM conference on computer and communications security. ACM; 2012. p. 833–44.

Davi L, Dmitrienko A, Sadeghi A-R, Winandy M. Privilege escalation attacks on android. In: International conference on information security. New York: Springer; 2010. p. 346–60.

Jovičić B, Simić D. Common web application attack types and security using asp .net. ComSIS, 2006.

Warkentin M, Willison R. Behavioral and policy issues in information systems security: the insider threat. Eur J Inform Syst. 2009;18(2):101–5.

Kügler D. “man in the middle” attacks on bluetooth. In: International Conference on Financial Cryptography. New York: Springer; 2003, p. 149–61.

Virvilis N, Gritzalis D. The big four-what we did wrong in advanced persistent threat detection. In: 2013 International Conference on Availability, Reliability and Security. IEEE; 2013. p. 248–54.

Boyd SW, Keromytis AD. Sqlrand: Preventing sql injection attacks. In: International conference on applied cryptography and network security. New York: Springer; 2004. p. 292–302.

Sigler K. Crypto-jacking: how cyber-criminals are exploiting the crypto-currency boom. Comput Fraud Sec. 2018;2018(9):12–4.

2019 data breach investigations report, https://enterprise.verizon.com/resources/reports/dbir/ . Accessed 20 Oct 2019.

Khraisat A, Gondal I, Vamplew P, Kamruzzaman J. Survey of intrusion detection systems: techniques, datasets and challenges. Cybersecurity. 2019;2(1):20.

Johnson L. Computer incident response and forensics team management: conducting a successful incident response, 2013.

Brahmi I, Brahmi H, Yahia SB. A multi-agents intrusion detection system using ontology and clustering techniques. In: IFIP international conference on computer science and its applications. New York: Springer; 2015. p. 381–93.

Qu X, Yang L, Guo K, Ma L, Sun M, Ke M, Li M. A survey on the development of self-organizing maps for unsupervised intrusion detection. In: Mobile networks and applications. 2019;1–22.

Liao H-J, Lin C-HR, Lin Y-C, Tung K-Y. Intrusion detection system: a comprehensive review. J Netw Comput Appl. 2013;36(1):16–24.

Alazab A, Hobbs M, Abawajy J, Alazab M. Using feature selection for intrusion detection system. In: 2012 International symposium on communications and information technologies (ISCIT). IEEE; 2012. p. 296–301.

Viegas E, Santin AO, Franca A, Jasinski R, Pedroni VA, Oliveira LS. Towards an energy-efficient anomaly-based intrusion detection engine for embedded systems. IEEE Trans Comput. 2016;66(1):163–77.

Xin Y, Kong L, Liu Z, Chen Y, Li Y, Zhu H, Gao M, Hou H, Wang C. Machine learning and deep learning methods for cybersecurity. IEEE Access. 2018;6:35365–81.

Dutt I, Borah S, Maitra IK, Bhowmik K, Maity A, Das S. Real-time hybrid intrusion detection system using machine learning techniques. 2018, p. 885–94.

Ragsdale DJ, Carver C, Humphries JW, Pooch UW. Adaptation techniques for intrusion detection and intrusion response systems. In: Smc 2000 conference proceedings. 2000 IEEE international conference on systems, man and cybernetics.’cybernetics evolving to systems, humans, organizations, and their complex interactions’(cat. No. 0). IEEE; 2000. vol. 4, p. 2344–2349.

Cao L. Data science: challenges and directions. Commun ACM. 2017;60(8):59–68.

Rizk A, Elragal A. Data science: developing theoretical contributions in information systems via text analytics. J Big Data. 2020;7(1):1–26.

Lippmann RP, Fried DJ, Graf I, Haines JW, Kendall KR, McClung D, Weber D, Webster SE, Wyschogrod D, Cunningham RK, et al. Evaluating intrusion detection systems: The 1998 darpa off-line intrusion detection evaluation. In: Proceedings DARPA information survivability conference and exposition. DISCEX’00. IEEE; 2000. vol. 2, p. 12–26.

Kdd cup 99. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html . Accessed 20 Oct 2019.

Tavallaee M, Bagheri E, Lu W, Ghorbani AA. A detailed analysis of the kdd cup 99 data set. In: 2009 IEEE symposium on computational intelligence for security and defense applications. IEEE; 2009. p. 1–6.

Caida ddos attack 2007 dataset. http://www.caida.org/data/ passive/ddos-20070804-dataset.xml/ . Accessed 20 Oct 2019.

Caida anonymized internet traces 2008 dataset. https://www.caida.org/data/passive/passive-2008-dataset . Accessed 20 Oct 2019.

Isot botnet dataset. https://www.uvic.ca/engineering/ece/isot/ datasets/index.php/ . Accessed 20 Oct 2019.

The honeynet project. http://www.honeynet.org/chapters/france/ . Accessed 20 Oct 2019.

Canadian institute of cybersecurity, university of new brunswick, iscx dataset, http://www.unb.ca/cic/datasets/index.html/ . Accessed 20 Oct 2019.

Shiravi A, Shiravi H, Tavallaee M, Ghorbani AA. Toward developing a systematic approach to generate benchmark datasets for intrusion detection. Comput Secur. 2012;31(3):357–74.

The ctu-13 dataset. https://stratosphereips.org/category/datasets-ctu13 . Accessed 20 Oct 2019.

Moustafa N, Slay J. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In: 2015 Military Communications and Information Systems Conference (MilCIS). IEEE; 2015. p. 1–6.

Cse-cic-ids2018 [online]. available: https://www.unb.ca/cic/ datasets/ids-2018.html/ . Accessed 20 Oct 2019.

Cic-ddos2019 [online]. available: https://www.unb.ca/cic/datasets/ddos-2019.html/ . Accessed 28 Mar 2019.

Jing X, Yan Z, Jiang X, Pedrycz W. Network traffic fusion and analysis against ddos flooding attacks with a novel reversible sketch. Inform Fusion. 2019;51:100–13.

Xie M, Hu J, Yu X, Chang E. Evaluating host-based anomaly detection systems: application of the frequency-based algorithms to adfa-ld. In: International conference on network and system security. New York: Springer; 2015. p. 542–49.

Lindauer B, Glasser J, Rosen M, Wallnau KC, ExactData L. Generating test data for insider threat detectors. JoWUA. 2014;5(2):80–94.

Glasser J, Lindauer B. Bridging the gap: A pragmatic approach to generating insider threat data. In: 2013 IEEE Security and Privacy Workshops. IEEE; 2013. p. 98–104.

Enronspam. https://labs-repos.iit.demokritos.gr/skel/i-config/downloads/enron-spam/ . Accessed 20 Oct 2019.

Spamassassin. http://www.spamassassin.org/publiccorpus/ . Accessed 20 Oct 2019.

Lingspam. https://labs-repos.iit.demokritos.gr/skel/i-config/downloads/lingspampublic.tar.gz/ . Accessed 20 Oct 2019.

Alexa top sites. https://aws.amazon.com/alexa-top-sites/ . Accessed 20 Oct 2019.

Bambenek consulting—master feeds. available online: http://osint.bambenekconsulting.com/feeds/ . Accessed 20 Oct 2019.

Dgarchive. https://dgarchive.caad.fkie.fraunhofer.de/site/ . Accessed 20 Oct 2019.

Zago M, Pérez MG, Pérez GM. Umudga: A dataset for profiling algorithmically generated domain names in botnet detection. Data in Brief. 2020;105400.

Zhou Y, Jiang X. Dissecting android malware: characterization and evolution. In: 2012 IEEE Symposium on security and privacy. IEEE; 2012. p. 95–109.

Virusshare. http://virusshare.com/ . Accessed 20 Oct 2019.

Virustotal. https://virustotal.com/ . Accessed 20 Oct 2019.

Comodo. https://www.comodo.com/home/internet-security/updates/vdp/database . Accessed 20 Oct 2019.

Contagio. http://contagiodump.blogspot.com/ . Accessed 20 Oct 2019.

Kumar R, Xiaosong Z, Khan RU, Kumar J, Ahad I. Effective and explainable detection of android malware based on machine learning algorithms. In: Proceedings of the 2018 international conference on computing and artificial intelligence. ACM; 2018. p. 35–40.

Microsoft malware classification (big 2015). arXiv:org/abs/1802.10135/ . Accessed 20 Oct 2019.

Koroniotis N, Moustafa N, Sitnikova E, Turnbull B. Towards the development of realistic botnet dataset in the internet of things for network forensic analytics: bot-iot dataset. Future Gen Comput Syst. 2019;100:779–96.

McIntosh TR, Jang-Jaccard J, Watters PA. Large scale behavioral analysis of ransomware attacks. In: International conference on neural information processing. New York: Springer; 2018. p. 217–29.

Han J, Pei J, Kamber M. Data mining: concepts and techniques, 2011.

Witten IH, Frank E. Data mining: Practical machine learning tools and techniques, 2005.

Dua S, Du X. Data mining and machine learning in cybersecurity, 2016.

Kotpalliwar MV, Wajgi R. Classification of attacks using support vector machine (svm) on kddcup’99 ids database. In: 2015 Fifth international conference on communication systems and network technologies. IEEE; 2015. p. 987–90.

Pervez MS, Farid DM. Feature selection and intrusion classification in nsl-kdd cup 99 dataset employing svms. In: The 8th international conference on software, knowledge, information management and applications (SKIMA 2014). IEEE; 2014. p. 1–6.

Yan M, Liu Z. A new method of transductive svm-based network intrusion detection. In: International conference on computer and computing technologies in agriculture. New York: Springer; 2010. p. 87–95.

Li Y, Xia J, Zhang S, Yan J, Ai X, Dai K. An efficient intrusion detection system based on support vector machines and gradually feature removal method. Expert Syst Appl. 2012;39(1):424–30.

Raman MG, Somu N, Jagarapu S, Manghnani T, Selvam T, Krithivasan K, Sriram VS. An efficient intrusion detection technique based on support vector machine and improved binary gravitational search algorithm. Artificial Intelligence Review. 2019, p. 1–32.

Kokila R, Selvi ST, Govindarajan K. Ddos detection and analysis in sdn-based environment using support vector machine classifier. In: 2014 Sixth international conference on advanced computing (ICoAC). IEEE; 2014. p. 205–10.

Xie M, Hu J, Slay J. Evaluating host-based anomaly detection systems: Application of the one-class svm algorithm to adfa-ld. In: 2014 11th international conference on fuzzy systems and knowledge discovery (FSKD). IEEE; 2014. p. 978–82.

Saxena H, Richariya V. Intrusion detection in kdd99 dataset using svm-pso and feature reduction with information gain. Int J Comput Appl. 2014;98:6.

Chandrasekhar A, Raghuveer K. Confederation of fcm clustering, ann and svm techniques to implement hybrid nids using corrected kdd cup 99 dataset. In: 2014 international conference on communication and signal processing. IEEE; 2014. p. 672–76.

Shapoorifard H, Shamsinejad P. Intrusion detection using a novel hybrid method incorporating an improved knn. Int J Comput Appl. 2017;173(1):5–9.

Vishwakarma S, Sharma V, Tiwari A. An intrusion detection system using knn-aco algorithm. Int J Comput Appl. 2017;171(10):18–23.

Meng W, Li W, Kwok L-F. Design of intelligent knn-based alarm filter using knowledge-based alert verification in intrusion detection. Secur Commun Netw. 2015;8(18):3883–95.

Dada E. A hybridized svm-knn-pdapso approach to intrusion detection system. In: Proc. Fac. Seminar Ser., 2017, p. 14–21.

Sharifi AM, Amirgholipour SK, Pourebrahimi A. Intrusion detection based on joint of k-means and knn. J Converg Inform Technol. 2015;10(5):42.

Lin W-C, Ke S-W, Tsai C-F. Cann: an intrusion detection system based on combining cluster centers and nearest neighbors. Knowl Based Syst. 2015;78:13–21.

Koc L, Mazzuchi TA, Sarkani S. A network intrusion detection system based on a hidden naïve bayes multiclass classifier. Exp Syst Appl. 2012;39(18):13492–500.

Moon D, Im H, Kim I, Park JH. Dtb-ids: an intrusion detection system based on decision tree using behavior analysis for preventing apt attacks. J Supercomput. 2017;73(7):2881–95.

Ingre, B., Yadav, A., Soni, A.K.: Decision tree based intrusion detection system for nsl-kdd dataset. In: International conference on information and communication technology for intelligent systems. New York: Springer; 2017. p. 207–18.

Malik AJ, Khan FA. A hybrid technique using binary particle swarm optimization and decision tree pruning for network intrusion detection. Cluster Comput. 2018;21(1):667–80.

Relan NG, Patil DR. Implementation of network intrusion detection system using variant of decision tree algorithm. In: 2015 international conference on nascent technologies in the engineering field (ICNTE). IEEE; 2015. p. 1–5.

Rai K, Devi MS, Guleria A. Decision tree based algorithm for intrusion detection. Int J Adv Netw Appl. 2016;7(4):2828.

Sarker IH, Abushark YB, Alsolami F, Khan AI. Intrudtree: a machine learning based cyber security intrusion detection model. Symmetry. 2020;12(5):754.

Puthran S, Shah K. Intrusion detection using improved decision tree algorithm with binary and quad split. In: International symposium on security in computing and communication. New York: Springer; 2016. p. 427–438.

Balogun AO, Jimoh RG. Anomaly intrusion detection using an hybrid of decision tree and k-nearest neighbor, 2015.

Azad C, Jha VK. Genetic algorithm to solve the problem of small disjunct in the decision tree based intrusion detection system. Int J Comput Netw Inform Secur. 2015;7(8):56.

Jo S, Sung H, Ahn B. A comparative study on the performance of intrusion detection using decision tree and artificial neural network models. J Korea Soc Dig Indus Inform Manag. 2015;11(4):33–45.

Zhan J, Zulkernine M, Haque A. Random-forests-based network intrusion detection systems. IEEE Trans Syst Man Cybern C. 2008;38(5):649–59.

Tajbakhsh A, Rahmati M, Mirzaei A. Intrusion detection using fuzzy association rules. Appl Soft Comput. 2009;9(2):462–9.

Mitchell R, Chen R. Behavior rule specification-based intrusion detection for safety critical medical cyber physical systems. IEEE Trans Depend Secure Comput. 2014;12(1):16–30.

Alazab M, Venkataraman S, Watters P. Towards understanding malware behaviour by the extraction of api calls. In: 2010 second cybercrime and trustworthy computing Workshop. IEEE; 2010. p. 52–59.

Yuan Y, Kaklamanos G, Hogrefe D. A novel semi-supervised adaboost technique for network anomaly detection. In: Proceedings of the 19th ACM international conference on modeling, analysis and simulation of wireless and mobile systems. ACM; 2016. p. 111–14.

Ariu D, Tronci R, Giacinto G. Hmmpayl: an intrusion detection system based on hidden markov models. Comput Secur. 2011;30(4):221–41.

Årnes A, Valeur F, Vigna G, Kemmerer RA. Using hidden markov models to evaluate the risks of intrusions. In: International workshop on recent advances in intrusion detection. New York: Springer; 2006. p. 145–64.

Hansen JV, Lowry PB, Meservy RD, McDonald DM. Genetic programming for prevention of cyberterrorism through dynamic and evolving intrusion detection. Decis Supp Syst. 2007;43(4):1362–74.

Aslahi-Shahri B, Rahmani R, Chizari M, Maralani A, Eslami M, Golkar MJ, Ebrahimi A. A hybrid method consisting of ga and svm for intrusion detection system. Neural Comput Appl. 2016;27(6):1669–76.

Alrawashdeh K, Purdy C. Toward an online anomaly intrusion detection system based on deep learning. In: 2016 15th IEEE international conference on machine learning and applications (ICMLA). IEEE; 2016. p. 195–200.

Yin C, Zhu Y, Fei J, He X. A deep learning approach for intrusion detection using recurrent neural networks. IEEE Access. 2017;5:21954–61.

Kim J, Kim J, Thu HLT, Kim H. Long short term memory recurrent neural network classifier for intrusion detection. In: 2016 international conference on platform technology and service (PlatCon). IEEE; 2016. p. 1–5.

Almiani M, AbuGhazleh A, Al-Rahayfeh A, Atiewi S, Razaque A. Deep recurrent neural network for iot intrusion detection system. Simulation Modelling Practice and Theory. 2019;102031.

Kolosnjaji B, Zarras A, Webster G, Eckert C. Deep learning for classification of malware system call sequences. In: Australasian joint conference on artificial intelligence. New York: Springer; 2016. p. 137–49.

Wang W, Zhu M, Zeng X, Ye X, Sheng Y. Malware traffic classification using convolutional neural network for representation learning. In: 2017 international conference on information networking (ICOIN). IEEE; 2017. p. 712–17.

Alauthman M, Aslam N, Al-kasassbeh M, Khan S, Al-Qerem A, Choo K-KR. An efficient reinforcement learning-based botnet detection approach. J Netw Comput Appl. 2020;150:102479.

Blanco R, Cilla JJ, Briongos S, Malagón P, Moya JM. Applying cost-sensitive classifiers with reinforcement learning to ids. In: International conference on intelligent data engineering and automated learning. New York: Springer; 2018. p. 531–38.

Lopez-Martin M, Carro B, Sanchez-Esguevillas A. Application of deep reinforcement learning to intrusion detection for supervised problems. Exp Syst Appl. 2020;141:112963.

Sarker IH, Kayes A, Watters P. Effectiveness analysis of machine learning classification models for predicting personalized context-aware smartphone usage. J Big Data. 2019;6(1):1–28.

Holte RC. Very simple classification rules perform well on most commonly used datasets. Mach Learn. 1993;11(1):63–90.

John GH, Langley P. Estimating continuous distributions in bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc.; 1995. p. 338–45.

Quinlan JR. C4.5: Programs for machine learning. Machine Learning, 1993.

Sarker IH, Colman A, Han J, Khan AI, Abushark YB, Salah K. Behavdt: a behavioral decision tree learning to build user-centric context-aware predictive model. Mobile Networks and Applications. 2019, p. 1–11.

Aha DW, Kibler D, Albert MK. Instance-based learning algorithms. Mach Learn. 1991;6(1):37–66.

Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK. Improvements to platt’s smo algorithm for svm classifier design. Neural Comput. 2001;13(3):637–49.

Freund Y, Schapire RE, et al: Experiments with a new boosting algorithm. In: Icml, vol. 96, p. 148–156 (1996). Citeseer

Le Cessie S, Van Houwelingen JC. Ridge estimators in logistic regression. J Royal Stat Soc C. 1992;41(1):191–201.

Watters PA, McCombie S, Layton R, Pieprzyk J. Characterising and predicting cyber attacks using the cyber attacker model profile (camp). J Money Launder Control. 2012.

Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

Sarker IH. Context-aware rule learning from smartphone data: survey, challenges and future directions. J Big Data. 2019;6(1):95.

MacQueen J. Some methods for classification and analysis of multivariate observations. In: Fifth Berkeley symposium on mathematical statistics and probability, vol. 1, 1967.

Rokach L. A survey of clustering algorithms. In: Data Mining and Knowledge Discovery Handbook. New York: Springer; 2010. p. 269–98.

Sneath PH. The application of computers to taxonomy. J Gen Microbiol. 1957;17:1.

Sorensen T. method of establishing groups of equal amplitude in plant sociology based on similarity of species. Biol Skr. 1948;5.

Sarker IH, Colman A, Kabir MA, Han J. Individualized time-series segmentation for mining mobile phone user behavior. Comput J. 2018;61(3):349–68.

Kim G, Lee S, Kim S. A novel hybrid intrusion detection method integrating anomaly detection with misuse detection. Exp Syst Appl. 2014;41(4):1690–700.

MathSciNet   Google Scholar  

Agrawal R, Imieliński T, Swami A. Mining association rules between sets of items in large databases. In: ACM SIGMOD Record. ACM; 1993. vol. 22, p. 207–16.

Flach PA, Lachiche N. Confirmation-guided discovery of first-order rules with tertius. Mach Learn. 2001;42(1–2):61–95.

Agrawal R, Srikant R, et al: Fast algorithms for mining association rules. In: Proc. 20th Int. Conf. Very Large Data Bases, VLDB, 1994, vol. 1215, p. 487–99.

Houtsma M, Swami A. Set-oriented mining for association rules in relational databases. In: Proceedings of the eleventh international conference on data engineering. IEEE; 1995. p. 25–33.

Ma BLWHY. Integrating classification and association rule mining. In: Proceedings of the fourth international conference on knowledge discovery and data mining, 1998.

Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. In: ACM Sigmod Record. ACM; 2000. vol. 29, p. 1–12.

Sarker IH, Salim FD. Mining user behavioral rules from smartphone data through association analysis. In: Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Melbourne, Australia. New York: Springer; 2018. p. 450–61.

Das A, Ng W-K, Woon Y-K. Rapid association rule mining. In: Proceedings of the tenth international conference on information and knowledge management. ACM; 2001. p. 474–81.

Zaki MJ. Scalable algorithms for association mining. IEEE Trans Knowl Data Eng. 2000;12(3):372–90.

Coelho IM, Coelho VN, Luz EJS, Ochi LS, Guimarães FG, Rios E. A gpu deep learning metaheuristic based model for time series forecasting. Appl Energy. 2017;201:412–8.

Van Efferen L, Ali-Eldin AM. A multi-layer perceptron approach for flow-based anomaly detection. In: 2017 International symposium on networks, computers and communications (ISNCC). IEEE; 2017. p. 1–6.

Liu H, Lang B, Liu M, Yan H. Cnn and rnn based payload classification methods for attack detection. Knowl Based Syst. 2019;163:332–41.

Berman DS, Buczak AL, Chavis JS, Corbett CL. A survey of deep learning methods for cyber security. Information. 2019;10(4):122.

Bellman R. A markovian decision process. J Math Mech. 1957;1:679–84.

Kaelbling LP, Littman ML, Moore AW. Reinforcement learning: a survey. J Artif Intell Res. 1996;4:237–85.

Sarker IH. A machine learning based robust prediction model for real-life mobile phone data. Internet of Things. 2019;5:180–93.

Kayes ASM, Han J, Colman A. OntCAAC: an ontology-based approach to context-aware access control for software services. Comput J. 2015;58(11):3000–34.

Kayes ASM, Rahayu W, Dillon T. An ontology-based approach to dynamic contextual role for pervasive access control. In: AINA 2018. IEEE Computer Society, 2018.

Colombo P, Ferrari E. Access control technologies for big data management systems: literature review and future trends. Cybersecurity. 2019;2(1):1–13.

Aleroud A, Karabatis G. Contextual information fusion for intrusion detection: a survey and taxonomy. Knowl Inform Syst. 2017;52(3):563–619.

Sarker IH, Abushark YB, Khan AI. Contextpca: Predicting context-aware smartphone apps usage based on machine learning techniques. Symmetry. 2020;12(4):499.

Madsen RE, Hansen LK, Winther O. Singular value decomposition and principal component analysis. Neural Netw. 2004;1:1–5.

Qiao L-B, Zhang B-F, Lai Z-Q, Su J-S. Mining of attack models in ids alerts from network backbone by a two-stage clustering method. In: 2012 IEEE 26th international parallel and distributed processing symposium workshops & Phd Forum. IEEE; 2012. p. 1263–9.

Sarker IH, Colman A, Han J. Recencyminer: mining recency-based personalized behavior from contextual smartphone data. J Big Data. 2019;6(1):49.

Ullah F, Babar MA. Architectural tactics for big data cybersecurity analytics systems: a review. J Syst Softw. 2019;151:81–118.

Zhao S, Leftwich K, Owens M, Magrone F, Schonemann J, Anderson B, Medhi D. I-can-mama: Integrated campus network monitoring and management. In: 2014 IEEE network operations and management symposium (NOMS). IEEE; 2014. p. 1–7.

Abomhara M, et al. Cyber security and the internet of things: vulnerabilities, threats, intruders and attacks. J Cyber Secur Mob. 2015;4(1):65–88.

Helali RGM. Data mining based network intrusion detection system: A survey. In: Novel algorithms and techniques in telecommunications and networking. New York: Springer; 2010. p. 501–505.

Ryoo J, Rizvi S, Aiken W, Kissell J. Cloud security auditing: challenges and emerging approaches. IEEE Secur Priv. 2013;12(6):68–74.

Densham B. Three cyber-security strategies to mitigate the impact of a data breach. Netw Secur. 2015;2015(1):5–8.

Salah K, Rehman MHU, Nizamuddin N, Al-Fuqaha A. Blockchain for ai: review and open research challenges. IEEE Access. 2019;7:10127–49.

Gandomi A, Haider M. Beyond the hype: big data concepts, methods, and analytics. Int J Inform Manag. 2015;35(2):137–44.

Golchha N. Big data-the information revolution. Int J Adv Res. 2015;1(12):791–4.

Hariri RH, Fredericks EM, Bowers KM. Uncertainty in big data analytics: survey, opportunities, and challenges. J Big Data. 2019;6(1):44.

Tsai C-W, Lai C-F, Chao H-C, Vasilakos AV. Big data analytics: a survey. J Big data. 2015;2(1):21.

Download references

Acknowledgements

The authors would like to thank all the reviewers for their rigorous review and comments in several revision rounds. The reviews are detailed and helpful to improve and finalize the manuscript. The authors are highly grateful to them.

Author information

Authors and affiliations.

Swinburne University of Technology, Melbourne, VIC, 3122, Australia

Iqbal H. Sarker

Chittagong University of Engineering and Technology, Chittagong, 4349, Bangladesh

La Trobe University, Melbourne, VIC, 3086, Australia

A. S. M. Kayes, Paul Watters & Alex Ng

University of Nevada, Reno, USA

Shahriar Badsha

Macquarie University, Sydney, NSW, 2109, Australia

Hamed Alqahtani

You can also search for this author in PubMed   Google Scholar

Contributions

This article provides not only a discussion on cybersecurity data science and relevant methods but also to discuss the applicability towards data-driven intelligent decision making in cybersecurity systems and services. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Iqbal H. Sarker .

Ethics declarations

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Sarker, I.H., Kayes, A.S.M., Badsha, S. et al. Cybersecurity data science: an overview from machine learning perspective. J Big Data 7 , 41 (2020). https://doi.org/10.1186/s40537-020-00318-5

Download citation

Received : 26 October 2019

Accepted : 21 June 2020

Published : 01 July 2020

DOI : https://doi.org/10.1186/s40537-020-00318-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Decision making
  • Cyber-attack
  • Security modeling
  • Intrusion detection
  • Cyber threat intelligence

research paper on security in big data

Help | Advanced Search

Computer Science > Cryptography and Security

Title: large language models for cyber security: a systematic literature review.

Abstract: The rapid advancement of Large Language Models (LLMs) has opened up new opportunities for leveraging artificial intelligence in various domains, including cybersecurity. As the volume and sophistication of cyber threats continue to grow, there is an increasing need for intelligent systems that can automatically detect vulnerabilities, analyze malware, and respond to attacks. In this survey, we conduct a comprehensive review of the literature on the application of LLMs in cybersecurity (LLM4Security). By comprehensively collecting over 30K relevant papers and systematically analyzing 127 papers from top security and software engineering venues, we aim to provide a holistic view of how LLMs are being used to solve diverse problems across the cybersecurity domain. Through our analysis, we identify several key findings. First, we observe that LLMs are being applied to a wide range of cybersecurity tasks, including vulnerability detection, malware analysis, network intrusion detection, and phishing detection. Second, we find that the datasets used for training and evaluating LLMs in these tasks are often limited in size and diversity, highlighting the need for more comprehensive and representative datasets. Third, we identify several promising techniques for adapting LLMs to specific cybersecurity domains, such as fine-tuning, transfer learning, and domain-specific pre-training. Finally, we discuss the main challenges and opportunities for future research in LLM4Security, including the need for more interpretable and explainable models, the importance of addressing data privacy and security concerns, and the potential for leveraging LLMs for proactive defense and threat hunting. Overall, our survey provides a comprehensive overview of the current state-of-the-art in LLM4Security and identifies several promising directions for future research.

Submission history

Access paper:.

  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Cyber risk and cybersecurity: a systematic review of data availability

  • Open access
  • Published: 17 February 2022
  • Volume 47 , pages 698–736, ( 2022 )

Cite this article

You have full access to this open access article

research paper on security in big data

  • Frank Cremer 1 ,
  • Barry Sheehan   ORCID: orcid.org/0000-0003-4592-7558 1 ,
  • Michael Fortmann 2 ,
  • Arash N. Kia 1 ,
  • Martin Mullins 1 ,
  • Finbarr Murphy 1 &
  • Stefan Materne 2  

66k Accesses

63 Citations

42 Altmetric

Explore all metrics

Cybercrime is estimated to have cost the global economy just under USD 1 trillion in 2020, indicating an increase of more than 50% since 2018. With the average cyber insurance claim rising from USD 145,000 in 2019 to USD 359,000 in 2020, there is a growing necessity for better cyber information sources, standardised databases, mandatory reporting and public awareness. This research analyses the extant academic and industry literature on cybersecurity and cyber risk management with a particular focus on data availability. From a preliminary search resulting in 5219 cyber peer-reviewed studies, the application of the systematic methodology resulted in 79 unique datasets. We posit that the lack of available data on cyber risk poses a serious problem for stakeholders seeking to tackle this issue. In particular, we identify a lacuna in open databases that undermine collective endeavours to better manage this set of risks. The resulting data evaluation and categorisation will support cybersecurity researchers and the insurance industry in their efforts to comprehend, metricise and manage cyber risks.

Similar content being viewed by others

research paper on security in big data

The Ethical Implications of Using Artificial Intelligence in Auditing

research paper on security in big data

Cyber Security Threats and Vulnerabilities: A Systematic Mapping Study

research paper on security in big data

Artificial Intelligence and Fraud Detection

Avoid common mistakes on your manuscript.

Introduction

Globalisation, digitalisation and smart technologies have escalated the propensity and severity of cybercrime. Whilst it is an emerging field of research and industry, the importance of robust cybersecurity defence systems has been highlighted at the corporate, national and supranational levels. The impacts of inadequate cybersecurity are estimated to have cost the global economy USD 945 billion in 2020 (Maleks Smith et al. 2020 ). Cyber vulnerabilities pose significant corporate risks, including business interruption, breach of privacy and financial losses (Sheehan et al. 2019 ). Despite the increasing relevance for the international economy, the availability of data on cyber risks remains limited. The reasons for this are many. Firstly, it is an emerging and evolving risk; therefore, historical data sources are limited (Biener et al. 2015 ). It could also be due to the fact that, in general, institutions that have been hacked do not publish the incidents (Eling and Schnell 2016 ). The lack of data poses challenges for many areas, such as research, risk management and cybersecurity (Falco et al. 2019 ). The importance of this topic is demonstrated by the announcement of the European Council in April 2021 that a centre of excellence for cybersecurity will be established to pool investments in research, technology and industrial development. The goal of this centre is to increase the security of the internet and other critical network and information systems (European Council 2021 ).

This research takes a risk management perspective, focusing on cyber risk and considering the role of cybersecurity and cyber insurance in risk mitigation and risk transfer. The study reviews the existing literature and open data sources related to cybersecurity and cyber risk. This is the first systematic review of data availability in the general context of cyber risk and cybersecurity. By identifying and critically analysing the available datasets, this paper supports the research community by aggregating, summarising and categorising all available open datasets. In addition, further information on datasets is attached to provide deeper insights and support stakeholders engaged in cyber risk control and cybersecurity. Finally, this research paper highlights the need for open access to cyber-specific data, without price or permission barriers.

The identified open data can support cyber insurers in their efforts on sustainable product development. To date, traditional risk assessment methods have been untenable for insurance companies due to the absence of historical claims data (Sheehan et al. 2021 ). These high levels of uncertainty mean that cyber insurers are more inclined to overprice cyber risk cover (Kshetri 2018 ). Combining external data with insurance portfolio data therefore seems to be essential to improve the evaluation of the risk and thus lead to risk-adjusted pricing (Bessy-Roland et al. 2021 ). This argument is also supported by the fact that some re/insurers reported that they are working to improve their cyber pricing models (e.g. by creating or purchasing databases from external providers) (EIOPA 2018 ). Figure  1 provides an overview of pricing tools and factors considered in the estimation of cyber insurance based on the findings of EIOPA ( 2018 ) and the research of Romanosky et al. ( 2019 ). The term cyber risk refers to all cyber risks and their potential impact.

figure 1

An overview of the current cyber insurance informational and methodological landscape, adapted from EIOPA ( 2018 ) and Romanosky et al. ( 2019 )

Besides the advantage of risk-adjusted pricing, the availability of open datasets helps companies benchmark their internal cyber posture and cybersecurity measures. The research can also help to improve risk awareness and corporate behaviour. Many companies still underestimate their cyber risk (Leong and Chen 2020 ). For policymakers, this research offers starting points for a comprehensive recording of cyber risks. Although in many countries, companies are obliged to report data breaches to the respective supervisory authority, this information is usually not accessible to the research community. Furthermore, the economic impact of these breaches is usually unclear.

As well as the cyber risk management community, this research also supports cybersecurity stakeholders. Researchers are provided with an up-to-date, peer-reviewed literature of available datasets showing where these datasets have been used. For example, this includes datasets that have been used to evaluate the effectiveness of countermeasures in simulated cyberattacks or to test intrusion detection systems. This reduces a time-consuming search for suitable datasets and ensures a comprehensive review of those available. Through the dataset descriptions, researchers and industry stakeholders can compare and select the most suitable datasets for their purposes. In addition, it is possible to combine the datasets from one source in the context of cybersecurity or cyber risk. This supports efficient and timely progress in cyber risk research and is beneficial given the dynamic nature of cyber risks.

Cyber risks are defined as “operational risks to information and technology assets that have consequences affecting the confidentiality, availability, and/or integrity of information or information systems” (Cebula et al. 2014 ). Prominent cyber risk events include data breaches and cyberattacks (Agrafiotis et al. 2018 ). The increasing exposure and potential impact of cyber risk have been highlighted in recent industry reports (e.g. Allianz 2021 ; World Economic Forum 2020 ). Cyberattacks on critical infrastructures are ranked 5th in the World Economic Forum's Global Risk Report. Ransomware, malware and distributed denial-of-service (DDoS) are examples of the evolving modes of a cyberattack. One example is the ransomware attack on the Colonial Pipeline, which shut down the 5500 mile pipeline system that delivers 2.5 million barrels of fuel per day and critical liquid fuel infrastructure from oil refineries to states along the U.S. East Coast (Brower and McCormick 2021 ). These and other cyber incidents have led the U.S. to strengthen its cybersecurity and introduce, among other things, a public body to analyse major cyber incidents and make recommendations to prevent a recurrence (Murphey 2021a ). Another example of the scope of cyberattacks is the ransomware NotPetya in 2017. The damage amounted to USD 10 billion, as the ransomware exploited a vulnerability in the windows system, allowing it to spread independently worldwide in the network (GAO 2021 ). In the same year, the ransomware WannaCry was launched by cybercriminals. The cyberattack on Windows software took user data hostage in exchange for Bitcoin cryptocurrency (Smart 2018 ). The victims included the National Health Service in Great Britain. As a result, ambulances were redirected to other hospitals because of information technology (IT) systems failing, leaving people in need of urgent assistance waiting. It has been estimated that 19,000 cancelled treatment appointments resulted from losses of GBP 92 million (Field 2018 ). Throughout the COVID-19 pandemic, ransomware attacks increased significantly, as working from home arrangements increased vulnerability (Murphey 2021b ).

Besides cyberattacks, data breaches can also cause high costs. Under the General Data Protection Regulation (GDPR), companies are obliged to protect personal data and safeguard the data protection rights of all individuals in the EU area. The GDPR allows data protection authorities in each country to impose sanctions and fines on organisations they find in breach. “For data breaches, the maximum fine can be €20 million or 4% of global turnover, whichever is higher” (GDPR.EU 2021 ). Data breaches often involve a large amount of sensitive data that has been accessed, unauthorised, by external parties, and are therefore considered important for information security due to their far-reaching impact (Goode et al. 2017 ). A data breach is defined as a “security incident in which sensitive, protected, or confidential data are copied, transmitted, viewed, stolen, or used by an unauthorized individual” (Freeha et al. 2021 ). Depending on the amount of data, the extent of the damage caused by a data breach can be significant, with the average cost being USD 392 million Footnote 1 (IBM Security 2020 ).

This research paper reviews the existing literature and open data sources related to cybersecurity and cyber risk, focusing on the datasets used to improve academic understanding and advance the current state-of-the-art in cybersecurity. Furthermore, important information about the available datasets is presented (e.g. use cases), and a plea is made for open data and the standardisation of cyber risk data for academic comparability and replication. The remainder of the paper is structured as follows. The next section describes the related work regarding cybersecurity and cyber risks. The third section outlines the review method used in this work and the process. The fourth section details the results of the identified literature. Further discussion is presented in the penultimate section and the final section concludes.

Related work

Due to the significance of cyber risks, several literature reviews have been conducted in this field. Eling ( 2020 ) reviewed the existing academic literature on the topic of cyber risk and cyber insurance from an economic perspective. A total of 217 papers with the term ‘cyber risk’ were identified and classified in different categories. As a result, open research questions are identified, showing that research on cyber risks is still in its infancy because of their dynamic and emerging nature. Furthermore, the author highlights that particular focus should be placed on the exchange of information between public and private actors. An improved information flow could help to measure the risk more accurately and thus make cyber risks more insurable and help risk managers to determine the right level of cyber risk for their company. In the context of cyber insurance data, Romanosky et al. ( 2019 ) analysed the underwriting process for cyber insurance and revealed how cyber insurers understand and assess cyber risks. For this research, they examined 235 American cyber insurance policies that were publicly available and looked at three components (coverage, application questionnaires and pricing). The authors state in their findings that many of the insurers used very simple, flat-rate pricing (based on a single calculation of expected loss), while others used more parameters such as the asset value of the company (or company revenue) or standard insurance metrics (e.g. deductible, limits), and the industry in the calculation. This is in keeping with Eling ( 2020 ), who states that an increased amount of data could help to make cyber risk more accurately measured and thus more insurable. Similar research on cyber insurance and data was conducted by Nurse et al. ( 2020 ). The authors examined cyber insurance practitioners' perceptions and the challenges they face in collecting and using data. In addition, gaps were identified during the research where further data is needed. The authors concluded that cyber insurance is still in its infancy, and there are still several unanswered questions (for example, cyber valuation, risk calculation and recovery). They also pointed out that a better understanding of data collection and use in cyber insurance would be invaluable for future research and practice. Bessy-Roland et al. ( 2021 ) come to a similar conclusion. They proposed a multivariate Hawkes framework to model and predict the frequency of cyberattacks. They used a public dataset with characteristics of data breaches affecting the U.S. industry. In the conclusion, the authors make the argument that an insurer has a better knowledge of cyber losses, but that it is based on a small dataset and therefore combination with external data sources seems essential to improve the assessment of cyber risks.

Several systematic reviews have been published in the area of cybersecurity (Kruse et al. 2017 ; Lee et al. 2020 ; Loukas et al. 2013 ; Ulven and Wangen 2021 ). In these papers, the authors concentrated on a specific area or sector in the context of cybersecurity. This paper adds to this extant literature by focusing on data availability and its importance to risk management and insurance stakeholders. With a priority on healthcare and cybersecurity, Kruse et al. ( 2017 ) conducted a systematic literature review. The authors identified 472 articles with the keywords ‘cybersecurity and healthcare’ or ‘ransomware’ in the databases Cumulative Index of Nursing and Allied Health Literature, PubMed and Proquest. Articles were eligible for this review if they satisfied three criteria: (1) they were published between 2006 and 2016, (2) the full-text version of the article was available, and (3) the publication is a peer-reviewed or scholarly journal. The authors found that technological development and federal policies (in the U.S.) are the main factors exposing the health sector to cyber risks. Loukas et al. ( 2013 ) conducted a review with a focus on cyber risks and cybersecurity in emergency management. The authors provided an overview of cyber risks in communication, sensor, information management and vehicle technologies used in emergency management and showed areas for which there is still no solution in the literature. Similarly, Ulven and Wangen ( 2021 ) reviewed the literature on cybersecurity risks in higher education institutions. For the literature review, the authors used the keywords ‘cyber’, ‘information threats’ or ‘vulnerability’ in connection with the terms ‘higher education, ‘university’ or ‘academia’. A similar literature review with a focus on Internet of Things (IoT) cybersecurity was conducted by Lee et al. ( 2020 ). The review revealed that qualitative approaches focus on high-level frameworks, and quantitative approaches to cybersecurity risk management focus on risk assessment and quantification of cyberattacks and impacts. In addition, the findings presented a four-step IoT cyber risk management framework that identifies, quantifies and prioritises cyber risks.

Datasets are an essential part of cybersecurity research, underlined by the following works. Ilhan Firat et al. ( 2021 ) examined various cybersecurity datasets in detail. The study was motivated by the fact that with the proliferation of the internet and smart technologies, the mode of cyberattacks is also evolving. However, in order to prevent such attacks, they must first be detected; the dissemination and further development of cybersecurity datasets is therefore critical. In their work, the authors observed studies of datasets used in intrusion detection systems. Khraisat et al. ( 2019 ) also identified a need for new datasets in the context of cybersecurity. The researchers presented a taxonomy of current intrusion detection systems, a comprehensive review of notable recent work, and an overview of the datasets commonly used for assessment purposes. In their conclusion, the authors noted that new datasets are needed because most machine-learning techniques are trained and evaluated on the knowledge of old datasets. These datasets do not contain new and comprehensive information and are partly derived from datasets from 1999. The authors noted that the core of this issue is the availability of new public datasets as well as their quality. The availability of data, how it is used, created and shared was also investigated by Zheng et al. ( 2018 ). The researchers analysed 965 cybersecurity research papers published between 2012 and 2016. They created a taxonomy of the types of data that are created and shared and then analysed the data collected via datasets. The researchers concluded that while datasets are recognised as valuable for cybersecurity research, the proportion of publicly available datasets is limited.

The main contributions of this review and what differentiates it from previous studies can be summarised as follows. First, as far as we can tell, it is the first work to summarise all available datasets on cyber risk and cybersecurity in the context of a systematic review and present them to the scientific community and cyber insurance and cybersecurity stakeholders. Second, we investigated, analysed, and made available the datasets to support efficient and timely progress in cyber risk research. And third, we enable comparability of datasets so that the appropriate dataset can be selected depending on the research area.

Methodology

Process and eligibility criteria.

The structure of this systematic review is inspired by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) framework (Page et al. 2021 ), and the search was conducted from 3 to 10 May 2021. Due to the continuous development of cyber risks and their countermeasures, only articles published in the last 10 years were considered. In addition, only articles published in peer-reviewed journals written in English were included. As a final criterion, only articles that make use of one or more cybersecurity or cyber risk datasets met the inclusion criteria. Specifically, these studies presented new or existing datasets, used them for methods, or used them to verify new results, as well as analysed them in an economic context and pointed out their effects. The criterion was fulfilled if it was clearly stated in the abstract that one or more datasets were used. A detailed explanation of this selection criterion can be found in the ‘Study selection’ section.

Information sources

In order to cover a complete spectrum of literature, various databases were queried to collect relevant literature on the topic of cybersecurity and cyber risks. Due to the spread of related articles across multiple databases, the literature search was limited to the following four databases for simplicity: IEEE Xplore, Scopus, SpringerLink and Web of Science. This is similar to other literature reviews addressing cyber risks or cybersecurity, including Sardi et al. ( 2021 ), Franke and Brynielsson ( 2014 ), Lagerström (2019), Eling and Schnell ( 2016 ) and Eling ( 2020 ). In this paper, all databases used in the aforementioned works were considered. However, only two studies also used all the databases listed. The IEEE Xplore database contains electrical engineering, computer science, and electronics work from over 200 journals and three million conference papers (IEEE 2021 ). Scopus includes 23,400 peer-reviewed journals from more than 5000 international publishers in the areas of science, engineering, medicine, social sciences and humanities (Scopus 2021 ). SpringerLink contains 3742 journals and indexes over 10 million scientific documents (SpringerLink 2021 ). Finally, Web of Science indexes over 9200 journals in different scientific disciplines (Science 2021 ).

A search string was created and applied to all databases. To make the search efficient and reproducible, the following search string with Boolean operator was used in all databases: cybersecurity OR cyber risk AND dataset OR database. To ensure uniformity of the search across all databases, some adjustments had to be made for the respective search engines. In Scopus, for example, the Advanced Search was used, and the field code ‘Title-ABS-KEY’ was integrated into the search string. For IEEE Xplore, the search was carried out with the Search String in the Command Search and ‘All Metadata’. In the Web of Science database, the Advanced Search was used. The special feature of this search was that it had to be carried out in individual steps. The first search was carried out with the terms cybersecurity OR cyber risk with the field tag Topic (T.S. =) and the second search with dataset OR database. Subsequently, these searches were combined, which then delivered the searched articles for review. For SpringerLink, the search string was used in the Advanced Search under the category ‘Find the resources with all of the words’. After conducting this search string, 5219 studies could be found. According to the eligibility criteria (period, language and only scientific journals), 1581 studies were identified in the databases:

Scopus: 135

Springer Link: 548

Web of Science: 534

An overview of the process is given in Fig.  2 . Combined with the results from the four databases, 854 articles without duplicates were identified.

figure 2

Literature search process and categorisation of the studies

Study selection

In the final step of the selection process, the articles were screened for relevance. Due to a large number of results, the abstracts were analysed in the first step of the process. The aim was to determine whether the article was relevant for the systematic review. An article fulfilled the criterion if it was recognisable in the abstract that it had made a contribution to datasets or databases with regard to cyber risks or cybersecurity. Specifically, the criterion was considered to be met if the abstract used datasets that address the causes or impacts of cyber risks, and measures in the area of cybersecurity. In this process, the number of articles was reduced to 288. The articles were then read in their entirety, and an expert panel of six people decided whether they should be used. This led to a final number of 255 articles. The years in which the articles were published and the exact number can be seen in Fig.  3 .

figure 3

Distribution of studies

Data collection process and synthesis of the results

For the data collection process, various data were extracted from the studies, including the names of the respective creators, the name of the dataset or database and the corresponding reference. It was also determined where the data came from. In the context of accessibility, it was determined whether access is free, controlled, available for purchase or not available. It was also determined when the datasets were created and the time period referenced. The application type and domain characteristics of the datasets were identified.

This section analyses the results of the systematic literature review. The previously identified studies are divided into three categories: datasets on the causes of cyber risks, datasets on the effects of cyber risks and datasets on cybersecurity. The classification is based on the intended use of the studies. This system of classification makes it easier for stakeholders to find the appropriate datasets. The categories are evaluated individually. Although complete information is available for a large proportion of datasets, this is not true for all of them. Accordingly, the abbreviation N/A has been inserted in the respective characters to indicate that this information could not be determined by the time of submission. The term ‘use cases in the literature’ in the following and supplementary tables refers to the application areas in which the corresponding datasets were used in the literature. The areas listed there refer to the topic area on which the researchers conducted their research. Since some datasets were used interdisciplinarily, the listed use cases in the literature are correspondingly longer. Before discussing each category in the next sections, Fig.  4 provides an overview of the number of datasets found and their year of creation. Figure  5 then shows the relationship between studies and datasets in the period under consideration. Figure  6 shows the distribution of studies, their use of datasets and their creation date. The number of datasets used is higher than the number of studies because the studies often used several datasets (Table 1 ).

figure 4

Distribution of dataset results

figure 5

Correlation between the studies and the datasets

figure 6

Distribution of studies and their use of datasets

Most of the datasets are generated in the U.S. (up to 58.2%). Canada and Australia rank next, with 11.3% and 5% of all the reviewed datasets, respectively.

Additionally, to create value for the datasets for the cyber insurance industry, an assessment of the applicability of each dataset has been provided for cyber insurers. This ‘Use Case Assessment’ includes the use of the data in the context of different analyses, calculation of cyber insurance premiums, and use of the information for the design of cyber insurance contracts or for additional customer services. To reasonably account for the transition of direct hyperlinks in the future, references were directed to the main websites for longevity (nearest resource point). In addition, the links to the main pages contain further information on the datasets and different versions related to the operating systems. The references were chosen in such a way that practitioners get the best overview of the respective datasets.

Case datasets

This section presents selected articles that use the datasets to analyse the causes of cyber risks. The datasets help identify emerging trends and allow pattern discovery in cyber risks. This information gives cybersecurity experts and cyber insurers the data to make better predictions and take appropriate action. For example, if certain vulnerabilities are not adequately protected, cyber insurers will demand a risk surcharge leading to an improvement in the risk-adjusted premium. Due to the capricious nature of cyber risks, existing data must be supplemented with new data sources (for example, new events, new methods or security vulnerabilities) to determine prevailing cyber exposure. The datasets of cyber risk causes could be combined with existing portfolio data from cyber insurers and integrated into existing pricing tools and factors to improve the valuation of cyber risks.

A portion of these datasets consists of several taxonomies and classifications of cyber risks. Aassal et al. ( 2020 ) propose a new taxonomy of phishing characteristics based on the interpretation and purpose of each characteristic. In comparison, Hindy et al. ( 2020 ) presented a taxonomy of network threats and the impact of current datasets on intrusion detection systems. A similar taxonomy was suggested by Kiwia et al. ( 2018 ). The authors presented a cyber kill chain-based taxonomy of banking Trojans features. The taxonomy built on a real-world dataset of 127 banking Trojans collected from December 2014 to January 2016 by a major U.K.-based financial organisation.

In the context of classification, Aamir et al. ( 2021 ) showed the benefits of machine learning for classifying port scans and DDoS attacks in a mixture of normal and attack traffic. Guo et al. ( 2020 ) presented a new method to improve malware classification based on entropy sequence features. The evaluation of this new method was conducted on different malware datasets.

To reconstruct attack scenarios and draw conclusions based on the evidence in the alert stream, Barzegar and Shajari ( 2018 ) use the DARPA2000 and MACCDC 2012 dataset for their research. Giudici and Raffinetti ( 2020 ) proposed a rank-based statistical model aimed at predicting the severity levels of cyber risk. The model used cyber risk data from the University of Milan. In contrast to the previous datasets, Skrjanc et al. ( 2018 ) used the older dataset KDD99 to monitor large-scale cyberattacks using a cauchy clustering method.

Amin et al. ( 2021 ) used a cyberattack dataset from the Canadian Institute for Cybersecurity to identify spatial clusters of countries with high rates of cyberattacks. In the context of cybercrime, Junger et al. ( 2020 ) examined crime scripts, key characteristics of the target company and the relationship between criminal effort and financial benefit. For their study, the authors analysed 300 cases of fraudulent activities against Dutch companies. With a similar focus on cybercrime, Mireles et al. ( 2019 ) proposed a metric framework to measure the effectiveness of the dynamic evolution of cyberattacks and defensive measures. To validate its usefulness, they used the DEFCON dataset.

Due to the rapidly changing nature of cyber risks, it is often impossible to obtain all information on them. Kim and Kim ( 2019 ) proposed an automated dataset generation system called CTIMiner that collects threat data from publicly available security reports and malware repositories. They released a dataset to the public containing about 640,000 records from 612 security reports published between January 2008 and 2019. A similar approach is proposed by Kim et al. ( 2020 ), using a named entity recognition system to extract core information from cyber threat reports automatically. They created a 498,000-tag dataset during their research (Ulven and Wangen 2021 ).

Within the framework of vulnerabilities and cybersecurity issues, Ulven and Wangen ( 2021 ) proposed an overview of mission-critical assets and everyday threat events, suggested a generic threat model, and summarised common cybersecurity vulnerabilities. With a focus on hospitality, Chen and Fiscus ( 2018 ) proposed several issues related to cybersecurity in this sector. They analysed 76 security incidents from the Privacy Rights Clearinghouse database. Supplementary Table 1 lists all findings that belong to the cyber causes dataset.

Impact datasets

This section outlines selected findings of the cyber impact dataset. For cyber insurers, these datasets can form an important basis for information, as they can be used to calculate cyber insurance premiums, evaluate specific cyber risks, formulate inclusions and exclusions in cyber wordings, and re-evaluate as well as supplement the data collected so far on cyber risks. For example, information on financial losses can help to better assess the loss potential of cyber risks. Furthermore, the datasets can provide insight into the frequency of occurrence of these cyber risks. The new datasets can be used to close any data gaps that were previously based on very approximate estimates or to find new results.

Eight studies addressed the costs of data breaches. For instance, Eling and Jung ( 2018 ) reviewed 3327 data breach events from 2005 to 2016 and identified an asymmetric dependence of monthly losses by breach type and industry. The authors used datasets from the Privacy Rights Clearinghouse for analysis. The Privacy Rights Clearinghouse datasets and the Breach level index database were also used by De Giovanni et al. ( 2020 ) to describe relationships between data breaches and bitcoin-related variables using the cointegration methodology. The data were obtained from the Department of Health and Human Services of healthcare facilities reporting data breaches and a national database of technical and organisational infrastructure information. Also in the context of data breaches, Algarni et al. ( 2021 ) developed a comprehensive, formal model that estimates the two components of security risks: breach cost and the likelihood of a data breach within 12 months. For their survey, the authors used two industrial reports from the Ponemon institute and VERIZON. To illustrate the scope of data breaches, Neto et al. ( 2021 ) identified 430 major data breach incidents among more than 10,000 incidents. The database created is available and covers the period 2018 to 2019.

With a direct focus on insurance, Biener et al. ( 2015 ) analysed 994 cyber loss cases from an operational risk database and investigated the insurability of cyber risks based on predefined criteria. For their study, they used data from the company SAS OpRisk Global Data. Similarly, Eling and Wirfs ( 2019 ) looked at a wide range of cyber risk events and actual cost data using the same database. They identified cyber losses and analysed them using methods from statistics and actuarial science. Using a similar reference, Farkas et al. ( 2021 ) proposed a method for analysing cyber claims based on regression trees to identify criteria for classifying and evaluating claims. Similar to Chen and Fiscus ( 2018 ), the dataset used was the Privacy Rights Clearinghouse database. Within the framework of reinsurance, Moro ( 2020 ) analysed cyber index-based information technology activity to see if index-parametric reinsurance coverage could suggest its cedant using data from a Symantec dataset.

Paté-Cornell et al. ( 2018 ) presented a general probabilistic risk analysis framework for cybersecurity in an organisation to be specified. The results are distributions of losses to cyberattacks, with and without considered countermeasures in support of risk management decisions based both on past data and anticipated incidents. The data used were from The Common Vulnerability and Exposures database and via confidential access to a database of cyberattacks on a large, U.S.-based organisation. A different conceptual framework for cyber risk classification and assessment was proposed by Sheehan et al. ( 2021 ). This framework showed the importance of proactive and reactive barriers in reducing companies’ exposure to cyber risk and quantifying the risk. Another approach to cyber risk assessment and mitigation was proposed by Mukhopadhyay et al. ( 2019 ). They estimated the probability of an attack using generalised linear models, predicted the security technology required to reduce the probability of cyberattacks, and used gamma and exponential distributions to best approximate the average loss data for each malicious attack. They also calculated the expected loss due to cyberattacks, calculated the net premium that would need to be charged by a cyber insurer, and suggested cyber insurance as a strategy to minimise losses. They used the CSI-FBI survey (1997–2010) to conduct their research.

In order to highlight the lack of data on cyber risks, Eling ( 2020 ) conducted a literature review in the areas of cyber risk and cyber insurance. Available information on the frequency, severity, and dependency structure of cyber risks was filtered out. In addition, open questions for future cyber risk research were set up. Another example of data collection on the impact of cyberattacks is provided by Sornette et al. ( 2013 ), who use a database of newspaper articles, press reports and other media to provide a predictive method to identify triggering events and potential accident scenarios and estimate their severity and frequency. A similar approach to data collection was used by Arcuri et al. ( 2020 ) to gather an original sample of global cyberattacks from newspaper reports sourced from the LexisNexis database. This collection is also used and applied to the fields of dynamic communication and cyber risk perception by Fang et al. ( 2021 ). To create a dataset of cyber incidents and disputes, Valeriano and Maness ( 2014 ) collected information on cyber interactions between rival states.

To assess trends and the scale of economic cybercrime, Levi ( 2017 ) examined datasets from different countries and their impact on crime policy. Pooser et al. ( 2018 ) investigated the trend in cyber risk identification from 2006 to 2015 and company characteristics related to cyber risk perception. The authors used a dataset of various reports from cyber insurers for their study. Walker-Roberts et al. ( 2020 ) investigated the spectrum of risk of a cybersecurity incident taking place in the cyber-physical-enabled world using the VERIS Community Database. The datasets of impacts identified are presented below. Due to overlap, some may also appear in the causes dataset (Supplementary Table 2).

Cybersecurity datasets

General intrusion detection.

General intrusion detection systems account for the largest share of countermeasure datasets. For companies or researchers focused on cybersecurity, the datasets can be used to test their own countermeasures or obtain information about potential vulnerabilities. For example, Al-Omari et al. ( 2021 ) proposed an intelligent intrusion detection model for predicting and detecting attacks in cyberspace, which was applied to dataset UNSW-NB 15. A similar approach was taken by Choras and Kozik ( 2015 ), who used machine learning to detect cyberattacks on web applications. To evaluate their method, they used the HTTP dataset CSIC 2010. For the identification of unknown attacks on web servers, Kamarudin et al. ( 2017 ) proposed an anomaly-based intrusion detection system using an ensemble classification approach. Ganeshan and Rodrigues ( 2020 ) showed an intrusion detection system approach, which clusters the database into several groups and detects the presence of intrusion in the clusters. In comparison, AlKadi et al. ( 2019 ) used a localisation-based model to discover abnormal patterns in network traffic. Hybrid models have been recommended by Bhattacharya et al. ( 2020 ) and Agrawal et al. ( 2019 ); the former is a machine-learning model based on principal component analysis for the classification of intrusion detection system datasets, while the latter is a hybrid ensemble intrusion detection system for anomaly detection using different datasets to detect patterns in network traffic that deviate from normal behaviour.

Agarwal et al. ( 2021 ) used three different machine learning algorithms in their research to find the most suitable for efficiently identifying patterns of suspicious network activity. The UNSW-NB15 dataset was used for this purpose. Kasongo and Sun ( 2020 ), Feed-Forward Deep Neural Network (FFDNN), Keshk et al. ( 2021 ), the privacy-preserving anomaly detection framework, and others also use the UNSW-NB 15 dataset as part of intrusion detection systems. The same dataset and others were used by Binbusayyis and Vaiyapuri ( 2019 ) to identify and compare key features for cyber intrusion detection. Atefinia and Ahmadi ( 2021 ) proposed a deep neural network model to reduce the false positive rate of an anomaly-based intrusion detection system. Fossaceca et al. ( 2015 ) focused in their research on the development of a framework that combined the outputs of multiple learners in order to improve the efficacy of network intrusion, and Gauthama Raman et al. ( 2020 ) presented a search algorithm based on Support Vector machine to improve the performance of the detection and false alarm rate to improve intrusion detection techniques. Ahmad and Alsemmeari ( 2020 ) targeted extreme learning machine techniques due to their good capabilities in classification problems and handling huge data. They used the NSL-KDD dataset as a benchmark.

With reference to prediction, Bakdash et al. ( 2018 ) used datasets from the U.S. Department of Defence to predict cyberattacks by malware. This dataset consists of weekly counts of cyber events over approximately seven years. Another prediction method was presented by Fan et al. ( 2018 ), which showed an improved integrated cybersecurity prediction method based on spatial-time analysis. Also, with reference to prediction, Ashtiani and Azgomi ( 2014 ) proposed a framework for the distributed simulation of cyberattacks based on high-level architecture. Kirubavathi and Anitha ( 2016 ) recommended an approach to detect botnets, irrespective of their structures, based on network traffic flow behaviour analysis and machine-learning techniques. Dwivedi et al. ( 2021 ) introduced a multi-parallel adaptive technique to utilise an adaption mechanism in the group of swarms for network intrusion detection. AlEroud and Karabatis ( 2018 ) presented an approach that used contextual information to automatically identify and query possible semantic links between different types of suspicious activities extracted from network flows.

Intrusion detection systems with a focus on IoT

In addition to general intrusion detection systems, a proportion of studies focused on IoT. Habib et al. ( 2020 ) presented an approach for converting traditional intrusion detection systems into smart intrusion detection systems for IoT networks. To enhance the process of diagnostic detection of possible vulnerabilities with an IoT system, Georgescu et al. ( 2019 ) introduced a method that uses a named entity recognition-based solution. With regard to IoT in the smart home sector, Heartfield et al. ( 2021 ) presented a detection system that is able to autonomously adjust the decision function of its underlying anomaly classification models to a smart home’s changing condition. Another intrusion detection system was suggested by Keserwani et al. ( 2021 ), which combined Grey Wolf Optimization and Particle Swam Optimization to identify various attacks for IoT networks. They used the KDD Cup 99, NSL-KDD and CICIDS-2017 to evaluate their model. Abu Al-Haija and Zein-Sabatto ( 2020 ) provide a comprehensive development of a new intelligent and autonomous deep-learning-based detection and classification system for cyberattacks in IoT communication networks that leverage the power of convolutional neural networks, abbreviated as IoT-IDCS-CNN (IoT-based Intrusion Detection and Classification System using Convolutional Neural Network). To evaluate the development, the authors used the NSL-KDD dataset. Biswas and Roy ( 2021 ) recommended a model that identifies malicious botnet traffic using novel deep-learning approaches like artificial neural networks gutted recurrent units and long- or short-term memory models. They tested their model with the Bot-IoT dataset.

With a more forensic background, Koroniotis et al. ( 2020 ) submitted a network forensic framework, which described the digital investigation phases for identifying and tracing attack behaviours in IoT networks. The suggested work was evaluated with the Bot-IoT and UINSW-NB15 datasets. With a focus on big data and IoT, Chhabra et al. ( 2020 ) presented a cyber forensic framework for big data analytics in an IoT environment using machine learning. Furthermore, the authors mentioned different publicly available datasets for machine-learning models.

A stronger focus on a mobile phones was exhibited by Alazab et al. ( 2020 ), which presented a classification model that combined permission requests and application programme interface calls. The model was tested with a malware dataset containing 27,891 Android apps. A similar approach was taken by Li et al. ( 2019a , b ), who proposed a reliable classifier for Android malware detection based on factorisation machine architecture and extraction of Android app features from manifest files and source code.

Literature reviews

In addition to the different methods and models for intrusion detection systems, various literature reviews on the methods and datasets were also found. Liu and Lang ( 2019 ) proposed a taxonomy of intrusion detection systems that uses data objects as the main dimension to classify and summarise machine learning and deep learning-based intrusion detection literature. They also presented four different benchmark datasets for machine-learning detection systems. Ahmed et al. ( 2016 ) presented an in-depth analysis of four major categories of anomaly detection techniques, which include classification, statistical, information theory and clustering. Hajj et al. ( 2021 ) gave a comprehensive overview of anomaly-based intrusion detection systems. Their article gives an overview of the requirements, methods, measurements and datasets that are used in an intrusion detection system.

Within the framework of machine learning, Chattopadhyay et al. ( 2018 ) conducted a comprehensive review and meta-analysis on the application of machine-learning techniques in intrusion detection systems. They also compared different machine learning techniques in different datasets and summarised the performance. Vidros et al. ( 2017 ) presented an overview of characteristics and methods in automatic detection of online recruitment fraud. They also published an available dataset of 17,880 annotated job ads, retrieved from the use of a real-life system. An empirical study of different unsupervised learning algorithms used in the detection of unknown attacks was presented by Meira et al. ( 2020 ).

New datasets

Kilincer et al. ( 2021 ) reviewed different intrusion detection system datasets in detail. They had a closer look at the UNS-NB15, ISCX-2012, NSL-KDD and CIDDS-001 datasets. Stojanovic et al. ( 2020 ) also provided a review on datasets and their creation for use in advanced persistent threat detection in the literature. Another review of datasets was provided by Sarker et al. ( 2020 ), who focused on cybersecurity data science as part of their research and provided an overview from a machine-learning perspective. Avila et al. ( 2021 ) conducted a systematic literature review on the use of security logs for data leak detection. They recommended a new classification of information leak, which uses the GDPR principles, identified the most widely publicly available dataset for threat detection, described the attack types in the datasets and the algorithms used for data leak detection. Tuncer et al. ( 2020 ) presented a bytecode-based detection method consisting of feature extraction using local neighbourhood binary patterns. They chose a byte-based malware dataset to investigate the performance of the proposed local neighbourhood binary pattern-based detection method. With a different focus, Mauro et al. ( 2020 ) gave an experimental overview of neural-based techniques relevant to intrusion detection. They assessed the value of neural networks using the Bot-IoT and UNSW-DB15 datasets.

Another category of results in the context of countermeasure datasets is those that were presented as new. Moreno et al. ( 2018 ) developed a database of 300 security-related accidents from European and American sources. The database contained cybersecurity-related events in the chemical and process industry. Damasevicius et al. ( 2020 ) proposed a new dataset (LITNET-2020) for network intrusion detection. The dataset is a new annotated network benchmark dataset obtained from the real-world academic network. It presents real-world examples of normal and under-attack network traffic. With a focus on IoT intrusion detection systems, Alsaedi et al. ( 2020 ) proposed a new benchmark IoT/IIot datasets for assessing intrusion detection system-enabled IoT systems. Also in the context of IoT, Vaccari et al. ( 2020 ) proposed a dataset focusing on message queue telemetry transport protocols, which can be used to train machine-learning models. To evaluate the performance of machine-learning classifiers, Mahfouz et al. ( 2020 ) created a dataset called Game Theory and Cybersecurity (GTCS). A dataset containing 22,000 malware and benign samples was constructed by Martin et al. ( 2019 ). The dataset can be used as a benchmark to test the algorithm for Android malware classification and clustering techniques. In addition, Laso et al. ( 2017 ) presented a dataset created to investigate how data and information quality estimates enable the detection of anomalies and malicious acts in cyber-physical systems. The dataset contained various cyberattacks and is publicly available.

In addition to the results described above, several other studies were found that fit into the category of countermeasures. Johnson et al. ( 2016 ) examined the time between vulnerability disclosures. Using another vulnerabilities database, Common Vulnerabilities and Exposures (CVE), Subroto and Apriyana ( 2019 ) presented an algorithm model that uses big data analysis of social media and statistical machine learning to predict cyber risks. A similar databank but with a different focus, Common Vulnerability Scoring System, was used by Chatterjee and Thekdi ( 2020 ) to present an iterative data-driven learning approach to vulnerability assessment and management for complex systems. Using the CICIDS2017 dataset to evaluate the performance, Malik et al. ( 2020 ) proposed a control plane-based orchestration for varied, sophisticated threats and attacks. The same dataset was used in another study by Lee et al. ( 2019 ), who developed an artificial security information event management system based on a combination of event profiling for data processing and different artificial network methods. To exploit the interdependence between multiple series, Fang et al. ( 2021 ) proposed a statistical framework. In order to validate the framework, the authors applied it to a dataset of enterprise-level security breaches from the Privacy Rights Clearinghouse and Identity Theft Center database. Another framework with a defensive aspect was recommended by Li et al. ( 2021 ) to increase the robustness of deep neural networks against adversarial malware evasion attacks. Sarabi et al. ( 2016 ) investigated whether and to what extent business details can help assess an organisation's risk of data breaches and the distribution of risk across different types of incidents to create policies for protection, detection and recovery from different forms of security incidents. They used data from the VERIS Community Database.

Datasets that have been classified into the cybersecurity category are detailed in Supplementary Table 3. Due to overlap, records from the previous tables may also be included.

This paper presented a systematic literature review of studies on cyber risk and cybersecurity that used datasets. Within this framework, 255 studies were fully reviewed and then classified into three different categories. Then, 79 datasets were consolidated from these studies. These datasets were subsequently analysed, and important information was selected through a process of filtering out. This information was recorded in a table and enhanced with further information as part of the literature analysis. This made it possible to create a comprehensive overview of the datasets. For example, each dataset contains a description of where the data came from and how the data has been used to date. This allows different datasets to be compared and the appropriate dataset for the use case to be selected. This research certainly has limitations, so our selection of datasets cannot necessarily be taken as a representation of all available datasets related to cyber risks and cybersecurity. For example, literature searches were conducted in four academic databases and only found datasets that were used in the literature. Many research projects also used old datasets that may no longer consider current developments. In addition, the data are often focused on only one observation and are limited in scope. For example, the datasets can only be applied to specific contexts and are also subject to further limitations (e.g. region, industry, operating system). In the context of the applicability of the datasets, it is unfortunately not possible to make a clear statement on the extent to which they can be integrated into academic or practical areas of application or how great this effort is. Finally, it remains to be pointed out that this is an overview of currently available datasets, which are subject to constant change.

Due to the lack of datasets on cyber risks in the academic literature, additional datasets on cyber risks were integrated as part of a further search. The search was conducted on the Google Dataset search portal. The search term used was ‘cyber risk datasets’. Over 100 results were found. However, due to the low significance and verifiability, only 20 selected datasets were included. These can be found in Table 2  in the “ Appendix ”.

The results of the literature review and datasets also showed that there continues to be a lack of available, open cyber datasets. This lack of data is reflected in cyber insurance, for example, as it is difficult to find a risk-based premium without a sufficient database (Nurse et al. 2020 ). The global cyber insurance market was estimated at USD 5.5 billion in 2020 (Dyson 2020 ). When compared to the USD 1 trillion global losses from cybercrime (Maleks Smith et al. 2020 ), it is clear that there exists a significant cyber risk awareness challenge for both the insurance industry and international commerce. Without comprehensive and qualitative data on cyber losses, it can be difficult to estimate potential losses from cyberattacks and price cyber insurance accordingly (GAO 2021 ). For instance, the average cyber insurance loss increased from USD 145,000 in 2019 to USD 359,000 in 2020 (FitchRatings 2021 ). Cyber insurance is an important risk management tool to mitigate the financial impact of cybercrime. This is particularly evident in the impact of different industries. In the Energy & Commodities financial markets, a ransomware attack on the Colonial Pipeline led to a substantial impact on the U.S. economy. As a result of the attack, about 45% of the U.S. East Coast was temporarily unable to obtain supplies of diesel, petrol and jet fuel. This caused the average price in the U.S. to rise 7 cents to USD 3.04 per gallon, the highest in seven years (Garber 2021 ). In addition, Colonial Pipeline confirmed that it paid a USD 4.4 million ransom to a hacker gang after the attack. Another ransomware attack occurred in the healthcare and government sector. The victim of this attack was the Irish Health Service Executive (HSE). A ransom payment of USD 20 million was demanded from the Irish government to restore services after the hack (Tidy 2021 ). In the car manufacturing sector, Miller and Valasek ( 2015 ) initiated a cyberattack that resulted in the recall of 1.4 million vehicles and cost manufacturers EUR 761 million. The risk that arises in the context of these events is the potential for the accumulation of cyber losses, which is why cyber insurers are not expanding their capacity. An example of this accumulation of cyber risks is the NotPetya malware attack, which originated in Russia, struck in Ukraine, and rapidly spread around the world, causing at least USD 10 billion in damage (GAO 2021 ). These events highlight the importance of proper cyber risk management.

This research provides cyber insurance stakeholders with an overview of cyber datasets. Cyber insurers can use the open datasets to improve their understanding and assessment of cyber risks. For example, the impact datasets can be used to better measure financial impacts and their frequencies. These data could be combined with existing portfolio data from cyber insurers and integrated with existing pricing tools and factors to better assess cyber risk valuation. Although most cyber insurers have sparse historical cyber policy and claims data, they remain too small at present for accurate prediction (Bessy-Roland et al. 2021 ). A combination of portfolio data and external datasets would support risk-adjusted pricing for cyber insurance, which would also benefit policyholders. In addition, cyber insurance stakeholders can use the datasets to identify patterns and make better predictions, which would benefit sustainable cyber insurance coverage. In terms of cyber risk cause datasets, cyber insurers can use the data to review their insurance products. For example, the data could provide information on which cyber risks have not been sufficiently considered in product design or where improvements are needed. A combination of cyber cause and cybersecurity datasets can help establish uniform definitions to provide greater transparency and clarity. Consistent terminology could lead to a more sustainable cyber market, where cyber insurers make informed decisions about the level of coverage and policyholders understand their coverage (The Geneva Association 2020).

In addition to the cyber insurance community, this research also supports cybersecurity stakeholders. The reviewed literature can be used to provide a contemporary, contextual and categorised summary of available datasets. This supports efficient and timely progress in cyber risk research and is beneficial given the dynamic nature of cyber risks. With the help of the described cybersecurity datasets and the identified information, a comparison of different datasets is possible. The datasets can be used to evaluate the effectiveness of countermeasures in simulated cyberattacks or to test intrusion detection systems.

In this paper, we conducted a systematic review of studies on cyber risk and cybersecurity databases. We found that most of the datasets are in the field of intrusion detection and machine learning and are used for technical cybersecurity aspects. The available datasets on cyber risks were relatively less represented. Due to the dynamic nature and lack of historical data, assessing and understanding cyber risk is a major challenge for cyber insurance stakeholders. To address this challenge, a greater density of cyber data is needed to support cyber insurers in risk management and researchers with cyber risk-related topics. With reference to ‘Open Science’ FAIR data (Jacobsen et al. 2020 ), mandatory reporting of cyber incidents could help improve cyber understanding, awareness and loss prevention among companies and insurers. Through greater availability of data, cyber risks can be better understood, enabling researchers to conduct more in-depth research into these risks. Companies could incorporate this new knowledge into their corporate culture to reduce cyber risks. For insurance companies, this would have the advantage that all insurers would have the same understanding of cyber risks, which would support sustainable risk-based pricing. In addition, common definitions of cyber risks could be derived from new data.

The cybersecurity databases summarised and categorised in this research could provide a different perspective on cyber risks that would enable the formulation of common definitions in cyber policies. The datasets can help companies addressing cybersecurity and cyber risk as part of risk management assess their internal cyber posture and cybersecurity measures. The paper can also help improve risk awareness and corporate behaviour, and provides the research community with a comprehensive overview of peer-reviewed datasets and other available datasets in the area of cyber risk and cybersecurity. This approach is intended to support the free availability of data for research. The complete tabulated review of the literature is included in the Supplementary Material.

This work provides directions for several paths of future work. First, there are currently few publicly available datasets for cyber risk and cybersecurity. The older datasets that are still widely used no longer reflect today's technical environment. Moreover, they can often only be used in one context, and the scope of the samples is very limited. It would be of great value if more datasets were publicly available that reflect current environmental conditions. This could help intrusion detection systems to consider current events and thus lead to a higher success rate. It could also compensate for the disadvantages of older datasets by collecting larger quantities of samples and making this contextualisation more widespread. Another area of research may be the integratability and adaptability of cybersecurity and cyber risk datasets. For example, it is often unclear to what extent datasets can be integrated or adapted to existing data. For cyber risks and cybersecurity, it would be helpful to know what requirements need to be met or what is needed to use the datasets appropriately. In addition, it would certainly be helpful to know whether datasets can be modified to be used for cyber risks or cybersecurity. Finally, the ability for stakeholders to identify machine-readable cybersecurity datasets would be useful because it would allow for even clearer delineations or comparisons between datasets. Due to the lack of publicly available datasets, concrete benchmarks often cannot be applied.

Average cost of a breach of more than 50 million records.

Aamir, M., S.S.H. Rizvi, M.A. Hashmani, M. Zubair, and J. Ahmad. 2021. Machine learning classification of port scanning and DDoS attacks: A comparative analysis. Mehran University Research Journal of Engineering and Technology 40 (1): 215–229. https://doi.org/10.22581/muet1982.2101.19 .

Article   Google Scholar  

Aamir, M., and S.M.A. Zaidi. 2019. DDoS attack detection with feature engineering and machine learning: The framework and performance evaluation. International Journal of Information Security 18 (6): 761–785. https://doi.org/10.1007/s10207-019-00434-1 .

Aassal, A. El, S. Baki, A. Das, and R.M. Verma. 2020. 2020. An in-depth benchmarking and evaluation of phishing detection research for security needs. IEEE Access 8: 22170–22192. https://doi.org/10.1109/ACCESS.2020.2969780 .

Abu Al-Haija, Q., and S. Zein-Sabatto. 2020. An efficient deep-learning-based detection and classification system for cyber-attacks in IoT communication networks. Electronics 9 (12): 26. https://doi.org/10.3390/electronics9122152 .

Adhikari, U., T.H. Morris, and S.Y. Pan. 2018. Applying Hoeffding adaptive trees for real-time cyber-power event and intrusion classification. IEEE Transactions on Smart Grid 9 (5): 4049–4060. https://doi.org/10.1109/tsg.2017.2647778 .

Agarwal, A., P. Sharma, M. Alshehri, A.A. Mohamed, and O. Alfarraj. 2021. Classification model for accuracy and intrusion detection using machine learning approach. PeerJ Computer Science . https://doi.org/10.7717/peerj-cs.437 .

Agrafiotis, I., J.R.C.. Nurse, M. Goldsmith, S. Creese, and D. Upton. 2018. A taxonomy of cyber-harms: Defining the impacts of cyber-attacks and understanding how they propagate. Journal of Cybersecurity 4: tyy006.

Agrawal, A., S. Mohammed, and J. Fiaidhi. 2019. Ensemble technique for intruder detection in network traffic. International Journal of Security and Its Applications 13 (3): 1–8. https://doi.org/10.33832/ijsia.2019.13.3.01 .

Ahmad, I., and R.A. Alsemmeari. 2020. Towards improving the intrusion detection through ELM (extreme learning machine). CMC Computers Materials & Continua 65 (2): 1097–1111. https://doi.org/10.32604/cmc.2020.011732 .

Ahmed, M., A.N. Mahmood, and J.K. Hu. 2016. A survey of network anomaly detection techniques. Journal of Network and Computer Applications 60: 19–31. https://doi.org/10.1016/j.jnca.2015.11.016 .

Al-Jarrah, O.Y., O. Alhussein, P.D. Yoo, S. Muhaidat, K. Taha, and K. Kim. 2016. Data randomization and cluster-based partitioning for Botnet intrusion detection. IEEE Transactions on Cybernetics 46 (8): 1796–1806. https://doi.org/10.1109/TCYB.2015.2490802 .

Al-Mhiqani, M.N., R. Ahmad, Z.Z. Abidin, W. Yassin, A. Hassan, K.H. Abdulkareem, N.S. Ali, and Z. Yunos. 2020. A review of insider threat detection: Classification, machine learning techniques, datasets, open challenges, and recommendations. Applied Sciences—Basel 10 (15): 41. https://doi.org/10.3390/app10155208 .

Al-Omari, M., M. Rawashdeh, F. Qutaishat, M. Alshira’H, and N. Ababneh. 2021. An intelligent tree-based intrusion detection model for cyber security. Journal of Network and Systems Management 29 (2): 18. https://doi.org/10.1007/s10922-021-09591-y .

Alabdallah, A., and M. Awad. 2018. Using weighted Support Vector Machine to address the imbalanced classes problem of Intrusion Detection System. KSII Transactions on Internet and Information Systems 12 (10): 5143–5158. https://doi.org/10.3837/tiis.2018.10.027 .

Alazab, M., M. Alazab, A. Shalaginov, A. Mesleh, and A. Awajan. 2020. Intelligent mobile malware detection using permission requests and API calls. Future Generation Computer Systems—the International Journal of eScience 107: 509–521. https://doi.org/10.1016/j.future.2020.02.002 .

Albahar, M.A., R.A. Al-Falluji, and M. Binsawad. 2020. An empirical comparison on malicious activity detection using different neural network-based models. IEEE Access 8: 61549–61564. https://doi.org/10.1109/ACCESS.2020.2984157 .

AlEroud, A.F., and G. Karabatis. 2018. Queryable semantics to detect cyber-attacks: A flow-based detection approach. IEEE Transactions on Systems, Man, and Cybernetics: Systems 48 (2): 207–223. https://doi.org/10.1109/TSMC.2016.2600405 .

Algarni, A.M., V. Thayananthan, and Y.K. Malaiya. 2021. Quantitative assessment of cybersecurity risks for mitigating data breaches in business systems. Applied Sciences (switzerland) . https://doi.org/10.3390/app11083678 .

Alhowaide, A., I. Alsmadi, and J. Tang. 2021. Towards the design of real-time autonomous IoT NIDS. Cluster Computing—the Journal of Networks Software Tools and Applications . https://doi.org/10.1007/s10586-021-03231-5 .

Ali, S., and Y. Li. 2019. Learning multilevel auto-encoders for DDoS attack detection in smart grid network. IEEE Access 7: 108647–108659. https://doi.org/10.1109/ACCESS.2019.2933304 .

AlKadi, O., N. Moustafa, B. Turnbull, and K.K.R. Choo. 2019. Mixture localization-based outliers models for securing data migration in cloud centers. IEEE Access 7: 114607–114618. https://doi.org/10.1109/ACCESS.2019.2935142 .

Allianz. 2021. Allianz Risk Barometer. https://www.agcs.allianz.com/content/dam/onemarketing/agcs/agcs/reports/Allianz-Risk-Barometer-2021.pdf . Accessed 15 May 2021.

Almiani, M., A. AbuGhazleh, A. Al-Rahayfeh, S. Atiewi, and Razaque, A. 2020. Deep recurrent neural network for IoT intrusion detection system. Simulation Modelling Practice and Theory 101: 102031. https://doi.org/10.1016/j.simpat.2019.102031

Alsaedi, A., N. Moustafa, Z. Tari, A. Mahmood, and A. Anwar. 2020. TON_IoT telemetry dataset: A new generation dataset of IoT and IIoT for data-driven intrusion detection systems. IEEE Access 8: 165130–165150. https://doi.org/10.1109/access.2020.3022862 .

Alsamiri, J., and K. Alsubhi. 2019. Internet of Things cyber attacks detection using machine learning. International Journal of Advanced Computer Science and Applications 10 (12): 627–634.

Alsharafat, W. 2013. Applying artificial neural network and eXtended classifier system for network intrusion detection. International Arab Journal of Information Technology 10 (3): 230–238.

Google Scholar  

Amin, R.W., H.E. Sevil, S. Kocak, G. Francia III., and P. Hoover. 2021. The spatial analysis of the malicious uniform resource locators (URLs): 2016 dataset case study. Information (switzerland) 12 (1): 1–18. https://doi.org/10.3390/info12010002 .

Arcuri, M.C., L.Z. Gai, F. Ielasi, and E. Ventisette. 2020. Cyber attacks on hospitality sector: Stock market reaction. Journal of Hospitality and Tourism Technology 11 (2): 277–290. https://doi.org/10.1108/jhtt-05-2019-0080 .

Arp, D., M. Spreitzenbarth, M. Hubner, H. Gascon, K. Rieck, and C.E.R.T. Siemens. 2014. Drebin: Effective and explainable detection of android malware in your pocket. In Ndss 14: 23–26.

Ashtiani, M., and M.A. Azgomi. 2014. A distributed simulation framework for modeling cyber attacks and the evaluation of security measures. Simulation 90 (9): 1071–1102. https://doi.org/10.1177/0037549714540221 .

Atefinia, R., and M. Ahmadi. 2021. Network intrusion detection using multi-architectural modular deep neural network. Journal of Supercomputing 77 (4): 3571–3593. https://doi.org/10.1007/s11227-020-03410-y .

Avila, R., R. Khoury, R. Khoury, and F. Petrillo. 2021. Use of security logs for data leak detection: A systematic literature review. Security and Communication Networks 2021: 29. https://doi.org/10.1155/2021/6615899 .

Azeez, N.A., T.J. Ayemobola, S. Misra, R. Maskeliunas, and R. Damasevicius. 2019. Network Intrusion Detection with a Hashing Based Apriori Algorithm Using Hadoop MapReduce. Computers 8 (4): 15. https://doi.org/10.3390/computers8040086 .

Bakdash, J.Z., S. Hutchinson, E.G. Zaroukian, L.R. Marusich, S. Thirumuruganathan, C. Sample, B. Hoffman, and G. Das. 2018. Malware in the future forecasting of analyst detection of cyber events. Journal of Cybersecurity . https://doi.org/10.1093/cybsec/tyy007 .

Barletta, V.S., D. Caivano, A. Nannavecchia, and M. Scalera. 2020. Intrusion detection for in-vehicle communication networks: An unsupervised Kohonen SOM approach. Future Internet . https://doi.org/10.3390/FI12070119 .

Barzegar, M., and M. Shajari. 2018. Attack scenario reconstruction using intrusion semantics. Expert Systems with Applications 108: 119–133. https://doi.org/10.1016/j.eswa.2018.04.030 .

Bessy-Roland, Y., A. Boumezoued, and C. Hillairet. 2021. Multivariate Hawkes process for cyber insurance. Annals of Actuarial Science 15 (1): 14–39.

Bhardwaj, A., V. Mangat, and R. Vig. 2020. Hyperband tuned deep neural network with well posed stacked sparse AutoEncoder for detection of DDoS attacks in cloud. IEEE Access 8: 181916–181929. https://doi.org/10.1109/ACCESS.2020.3028690 .

Bhati, B.S., C.S. Rai, B. Balamurugan, and F. Al-Turjman. 2020. An intrusion detection scheme based on the ensemble of discriminant classifiers. Computers & Electrical Engineering 86: 9. https://doi.org/10.1016/j.compeleceng.2020.106742 .

Bhattacharya, S., S.S.R. Krishnan, P.K.R. Maddikunta, R. Kaluri, S. Singh, T.R. Gadekallu, M. Alazab, and U. Tariq. 2020. A novel PCA-firefly based XGBoost classification model for intrusion detection in networks using GPU. Electronics 9 (2): 16. https://doi.org/10.3390/electronics9020219 .

Bibi, I., A. Akhunzada, J. Malik, J. Iqbal, A. Musaddiq, and S. Kim. 2020. A dynamic DL-driven architecture to combat sophisticated android malware. IEEE Access 8: 129600–129612. https://doi.org/10.1109/ACCESS.2020.3009819 .

Biener, C., M. Eling, and J.H. Wirfs. 2015. Insurability of cyber risk: An empirical analysis. The   Geneva Papers on Risk and Insurance—Issues and Practice 40 (1): 131–158. https://doi.org/10.1057/gpp.2014.19 .

Binbusayyis, A., and T. Vaiyapuri. 2019. Identifying and benchmarking key features for cyber intrusion detection: An ensemble approach. IEEE Access 7: 106495–106513. https://doi.org/10.1109/ACCESS.2019.2929487 .

Biswas, R., and S. Roy. 2021. Botnet traffic identification using neural networks. Multimedia Tools and Applications . https://doi.org/10.1007/s11042-021-10765-8 .

Bouyeddou, B., F. Harrou, B. Kadri, and Y. Sun. 2021. Detecting network cyber-attacks using an integrated statistical approach. Cluster Computing—the Journal of Networks Software Tools and Applications 24 (2): 1435–1453. https://doi.org/10.1007/s10586-020-03203-1 .

Bozkir, A.S., and M. Aydos. 2020. LogoSENSE: A companion HOG based logo detection scheme for phishing web page and E-mail brand recognition. Computers & Security 95: 18. https://doi.org/10.1016/j.cose.2020.101855 .

Brower, D., and M. McCormick. 2021. Colonial pipeline resumes operations following ransomware attack. Financial Times .

Cai, H., F. Zhang, and A. Levi. 2019. An unsupervised method for detecting shilling attacks in recommender systems by mining item relationship and identifying target items. The Computer Journal 62 (4): 579–597. https://doi.org/10.1093/comjnl/bxy124 .

Cebula, J.J., M.E. Popeck, and L.R. Young. 2014. A Taxonomy of Operational Cyber Security Risks Version 2 .

Chadza, T., K.G. Kyriakopoulos, and S. Lambotharan. 2020. Learning to learn sequential network attacks using hidden Markov models. IEEE Access 8: 134480–134497. https://doi.org/10.1109/ACCESS.2020.3011293 .

Chatterjee, S., and S. Thekdi. 2020. An iterative learning and inference approach to managing dynamic cyber vulnerabilities of complex systems. Reliability Engineering and System Safety . https://doi.org/10.1016/j.ress.2019.106664 .

Chattopadhyay, M., R. Sen, and S. Gupta. 2018. A comprehensive review and meta-analysis on applications of machine learning techniques in intrusion detection. Australasian Journal of Information Systems 22: 27.

Chen, H.S., and J. Fiscus. 2018. The inhospitable vulnerability: A need for cybersecurity risk assessment in the hospitality industry. Journal of Hospitality and Tourism Technology 9 (2): 223–234. https://doi.org/10.1108/JHTT-07-2017-0044 .

Chhabra, G.S., V.P. Singh, and M. Singh. 2020. Cyber forensics framework for big data analytics in IoT environment using machine learning. Multimedia Tools and Applications 79 (23–24): 15881–15900. https://doi.org/10.1007/s11042-018-6338-1 .

Chiba, Z., N. Abghour, K. Moussaid, A. Elomri, and M. Rida. 2019. Intelligent approach to build a Deep Neural Network based IDS for cloud environment using combination of machine learning algorithms. Computers and Security 86: 291–317. https://doi.org/10.1016/j.cose.2019.06.013 .

Choras, M., and R. Kozik. 2015. Machine learning techniques applied to detect cyber attacks on web applications. Logic Journal of the IGPL 23 (1): 45–56. https://doi.org/10.1093/jigpal/jzu038 .

Chowdhury, S., M. Khanzadeh, R. Akula, F. Zhang, S. Zhang, H. Medal, M. Marufuzzaman, and L. Bian. 2017. Botnet detection using graph-based feature clustering. Journal of Big Data 4 (1): 14. https://doi.org/10.1186/s40537-017-0074-7 .

Cost Of A Cyber Incident: Systematic Review And Cross-Validation, Cybersecurity & Infrastructure Agency , 1, https://www.cisa.gov/sites/default/files/publications/CISA-OCE_Cost_of_Cyber_Incidents_Study-FINAL_508.pdf (2020).

D’Hooge, L., T. Wauters, B. Volckaert, and F. De Turck. 2019. Classification hardness for supervised learners on 20 years of intrusion detection data. IEEE Access 7: 167455–167469. https://doi.org/10.1109/access.2019.2953451 .

Damasevicius, R., A. Venckauskas, S. Grigaliunas, J. Toldinas, N. Morkevicius, T. Aleliunas, and P. Smuikys. 2020. LITNET-2020: An annotated real-world network flow dataset for network intrusion detection. Electronics 9 (5): 23. https://doi.org/10.3390/electronics9050800 .

De Giovanni, A.L.D., and M. Pirra. 2020. On the determinants of data breaches: A cointegration analysis. Decisions in Economics and Finance . https://doi.org/10.1007/s10203-020-00301-y .

Deng, L., D. Li, X. Yao, and H. Wang. 2019. Retracted Article: Mobile network intrusion detection for IoT system based on transfer learning algorithm. Cluster Computing 22 (4): 9889–9904. https://doi.org/10.1007/s10586-018-1847-2 .

Donkal, G., and G.K. Verma. 2018. A multimodal fusion based framework to reinforce IDS for securing Big Data environment using Spark. Journal of Information Security and Applications 43: 1–11. https://doi.org/10.1016/j.jisa.2018.10.001 .

Dunn, C., N. Moustafa, and B. Turnbull. 2020. Robustness evaluations of sustainable machine learning models against data Poisoning attacks in the Internet of Things. Sustainability 12 (16): 17. https://doi.org/10.3390/su12166434 .

Dwivedi, S., M. Vardhan, and S. Tripathi. 2021. Multi-parallel adaptive grasshopper optimization technique for detecting anonymous attacks in wireless networks. Wireless Personal Communications . https://doi.org/10.1007/s11277-021-08368-5 .

Dyson, B. 2020. COVID-19 crisis could be ‘watershed’ for cyber insurance, says Swiss Re exec. https://www.spglobal.com/marketintelligence/en/news-insights/latest-news-headlines/covid-19-crisis-could-be-watershed-for-cyber-insurance-says-swiss-re-exec-59197154 . Accessed 7 May 2020.

EIOPA. 2018. Understanding cyber insurance—a structured dialogue with insurance companies. https://www.eiopa.europa.eu/sites/default/files/publications/reports/eiopa_understanding_cyber_insurance.pdf . Accessed 28 May 2018

Elijah, A.V., A. Abdullah, N.Z. JhanJhi, M. Supramaniam, and O.B. Abdullateef. 2019. Ensemble and deep-learning methods for two-class and multi-attack anomaly intrusion detection: An empirical study. International Journal of Advanced Computer Science and Applications 10 (9): 520–528.

Eling, M., and K. Jung. 2018. Copula approaches for modeling cross-sectional dependence of data breach losses. Insurance Mathematics & Economics 82: 167–180. https://doi.org/10.1016/j.insmatheco.2018.07.003 .

Eling, M., and W. Schnell. 2016. What do we know about cyber risk and cyber risk insurance? Journal of Risk Finance 17 (5): 474–491. https://doi.org/10.1108/jrf-09-2016-0122 .

Eling, M., and J. Wirfs. 2019. What are the actual costs of cyber risk events? European Journal of Operational Research 272 (3): 1109–1119. https://doi.org/10.1016/j.ejor.2018.07.021 .

Eling, M. 2020. Cyber risk research in business and actuarial science. European Actuarial Journal 10 (2): 303–333.

Elmasry, W., A. Akbulut, and A.H. Zaim. 2019. Empirical study on multiclass classification-based network intrusion detection. Computational Intelligence 35 (4): 919–954. https://doi.org/10.1111/coin.12220 .

Elsaid, S.A., and N.S. Albatati. 2020. An optimized collaborative intrusion detection system for wireless sensor networks. Soft Computing 24 (16): 12553–12567. https://doi.org/10.1007/s00500-020-04695-0 .

Estepa, R., J.E. Díaz-Verdejo, A. Estepa, and G. Madinabeitia. 2020. How much training data is enough? A case study for HTTP anomaly-based intrusion detection. IEEE Access 8: 44410–44425. https://doi.org/10.1109/ACCESS.2020.2977591 .

European Council. 2021. Cybersecurity: how the EU tackles cyber threats. https://www.consilium.europa.eu/en/policies/cybersecurity/ . Accessed 10 May 2021

Falco, G. et al. 2019. Cyber risk research impeded by disciplinary barriers. Science (American Association for the Advancement of Science) 366 (6469): 1066–1069.

Fan, Z.J., Z.P. Tan, C.X. Tan, and X. Li. 2018. An improved integrated prediction method of cyber security situation based on spatial-time analysis. Journal of Internet Technology 19 (6): 1789–1800. https://doi.org/10.3966/160792642018111906015 .

Fang, Z.J., M.C. Xu, S.H. Xu, and T.Z. Hu. 2021. A framework for predicting data breach risk: Leveraging dependence to cope with sparsity. IEEE Transactions on Information Forensics and Security 16: 2186–2201. https://doi.org/10.1109/tifs.2021.3051804 .

Farkas, S., O. Lopez, and M. Thomas. 2021. Cyber claim analysis using Generalized Pareto regression trees with applications to insurance. Insurance: Mathematics and Economics 98: 92–105. https://doi.org/10.1016/j.insmatheco.2021.02.009 .

Farsi, H., A. Fanian, and Z. Taghiyarrenani. 2019. A novel online state-based anomaly detection system for process control networks. International Journal of Critical Infrastructure Protection 27: 11. https://doi.org/10.1016/j.ijcip.2019.100323 .

Ferrag, M.A., L. Maglaras, S. Moschoyiannis, and H. Janicke. 2020. Deep learning for cyber security intrusion detection: Approaches, datasets, and comparative study. Journal of Information Security and Applications 50: 19. https://doi.org/10.1016/j.jisa.2019.102419 .

Field, M. 2018. WannaCry cyber attack cost the NHS £92m as 19,000 appointments cancelled. https://www.telegraph.co.uk/technology/2018/10/11/wannacry-cyber-attack-cost-nhs-92m-19000-appointments-cancelled/ . Accessed 9 May 2018.

FitchRatings. 2021. U.S. Cyber Insurance Market Update (Spike in Claims Leads to Decline in 2020 Underwriting Performance). https://www.fitchratings.com/research/insurance/us-cyber-insurance-market-update-spike-in-claims-leads-to-decline-in-2020-underwriting-performance-26-05-2021 .

Fossaceca, J.M., T.A. Mazzuchi, and S. Sarkani. 2015. MARK-ELM: Application of a novel Multiple Kernel Learning framework for improving the robustness of network intrusion detection. Expert Systems with Applications 42 (8): 4062–4080. https://doi.org/10.1016/j.eswa.2014.12.040 .

Franke, U., and J. Brynielsson. 2014. Cyber situational awareness–a systematic review of the literature. Computers & security 46: 18–31.

Freeha, K., K.J. Hwan, M. Lars, and M. Robin. 2021. Data breach management: An integrated risk model. Information & Management 58 (1): 103392. https://doi.org/10.1016/j.im.2020.103392 .

Ganeshan, R., and P. Rodrigues. 2020. Crow-AFL: Crow based adaptive fractional lion optimization approach for the intrusion detection. Wireless Personal Communications 111 (4): 2065–2089. https://doi.org/10.1007/s11277-019-06972-0 .

GAO. 2021. CYBER INSURANCE—Insurers and policyholders face challenges in an evolving market. https://www.gao.gov/assets/gao-21-477.pdf . Accessed 16 May 2021.

Garber, J. 2021. Colonial Pipeline fiasco foreshadows impact of Biden energy policy. https://www.foxbusiness.com/markets/colonial-pipeline-fiasco-foreshadows-impact-of-biden-energy-policy . Accessed 4 May 2021.

Gauthama Raman, M.R., N. Somu, S. Jagarapu, T. Manghnani, T. Selvam, K. Krithivasan, and V.S. Shankar Sriram. 2020. An efficient intrusion detection technique based on support vector machine and improved binary gravitational search algorithm. Artificial Intelligence Review 53 (5): 3255–3286. https://doi.org/10.1007/s10462-019-09762-z .

Gavel, S., A.S. Raghuvanshi, and S. Tiwari. 2021. Distributed intrusion detection scheme using dual-axis dimensionality reduction for Internet of things (IoT). Journal of Supercomputing . https://doi.org/10.1007/s11227-021-03697-5 .

GDPR.EU. 2021. FAQ. https://gdpr.eu/faq/ . Accessed 10 May 2021.

Georgescu, T.M., B. Iancu, and M. Zurini. 2019. Named-entity-recognition-based automated system for diagnosing cybersecurity situations in IoT networks. Sensors (switzerland) . https://doi.org/10.3390/s19153380 .

Giudici, P., and E. Raffinetti. 2020. Cyber risk ordering with rank-based statistical models. AStA Advances in Statistical Analysis . https://doi.org/10.1007/s10182-020-00387-0 .

Goh, J., S. Adepu, K.N. Junejo, and A. Mathur. 2016. A dataset to support research in the design of secure water treatment systems. In CRITIS.

Gong, X.Y., J.L. Lu, Y.F. Zhou, H. Qiu, and R. He. 2021. Model uncertainty based annotation error fixing for web attack detection. Journal of Signal Processing Systems for Signal Image and Video Technology 93 (2–3): 187–199. https://doi.org/10.1007/s11265-019-01494-1 .

Goode, S., H. Hoehle, V. Venkatesh, and S.A. Brown. 2017. USER compensation as a data breach recovery action: An investigation of the sony playstation network breach. MIS Quarterly 41 (3): 703–727.

Guo, H., S. Huang, C. Huang, Z. Pan, M. Zhang, and F. Shi. 2020. File entropy signal analysis combined with wavelet decomposition for malware classification. IEEE Access 8: 158961–158971. https://doi.org/10.1109/ACCESS.2020.3020330 .

Habib, M., I. Aljarah, and H. Faris. 2020. A Modified multi-objective particle swarm optimizer-based Lévy flight: An approach toward intrusion detection in Internet of Things. Arabian Journal for Science and Engineering 45 (8): 6081–6108. https://doi.org/10.1007/s13369-020-04476-9 .

Hajj, S., R. El Sibai, J.B. Abdo, J. Demerjian, A. Makhoul, and C. Guyeux. 2021. Anomaly-based intrusion detection systems: The requirements, methods, measurements, and datasets. Transactions on Emerging Telecommunications Technologies 32 (4): 36. https://doi.org/10.1002/ett.4240 .

Heartfield, R., G. Loukas, A. Bezemskij, and E. Panaousis. 2021. Self-configurable cyber-physical intrusion detection for smart homes using reinforcement learning. IEEE Transactions on Information Forensics and Security 16: 1720–1735. https://doi.org/10.1109/tifs.2020.3042049 .

Hemo, B., T. Gafni, K. Cohen, and Q. Zhao. 2020. Searching for anomalies over composite hypotheses. IEEE Transactions on Signal Processing 68: 1181–1196. https://doi.org/10.1109/TSP.2020.2971438

Hindy, H., D. Brosset, E. Bayne, A.K. Seeam, C. Tachtatzis, R. Atkinson, and X. Bellekens. 2020. A taxonomy of network threats and the effect of current datasets on intrusion detection systems. IEEE Access 8: 104650–104675. https://doi.org/10.1109/ACCESS.2020.3000179 .

Hong, W., D. Huang, C. Chen, and J. Lee. 2020. Towards accurate and efficient classification of power system contingencies and cyber-attacks using recurrent neural networks. IEEE Access 8: 123297–123309. https://doi.org/10.1109/ACCESS.2020.3007609 .

Husák, M., M. Zádník, V. Bartos, and P. Sokol. 2020. Dataset of intrusion detection alerts from a sharing platform. Data in Brief 33: 106530.

IBM Security. 2020. Cost of a Data breach Report. https://www.capita.com/sites/g/files/nginej291/files/2020-08/Ponemon-Global-Cost-of-Data-Breach-Study-2020.pdf . Accessed 19 May 2021.

IEEE. 2021. IEEE Quick Facts. https://www.ieee.org/about/at-a-glance.html . Accessed 11 May 2021.

Kilincer, I.F., F. Ertam, and S. Abdulkadir. 2021. Machine learning methods for cyber security intrusion detection: Datasets and comparative study. Computer Networks 188: 107840. https://doi.org/10.1016/j.comnet.2021.107840 .

Jaber, A.N., and S. Ul Rehman. 2020. FCM-SVM based intrusion detection system for cloud computing environment. Cluster Computing—the Journal of Networks Software Tools and Applications 23 (4): 3221–3231. https://doi.org/10.1007/s10586-020-03082-6 .

Jacobs, J., S. Romanosky, B. Edwards, M. Roytman, and I. Adjerid. 2019. Exploit prediction scoring system (epss). arXiv:1908.04856

Jacobsen, A. et al. 2020. FAIR principles: Interpretations and implementation considerations. Data Intelligence 2 (1–2): 10–29. https://doi.org/10.1162/dint_r_00024 .

Jahromi, A.N., S. Hashemi, A. Dehghantanha, R.M. Parizi, and K.K.R. Choo. 2020. An enhanced stacked LSTM method with no random initialization for malware threat hunting in safety and time-critical systems. IEEE Transactions on Emerging Topics in Computational Intelligence 4 (5): 630–640. https://doi.org/10.1109/TETCI.2019.2910243 .

Jang, S., S. Li, and Y. Sung. 2020. FastText-based local feature visualization algorithm for merged image-based malware classification framework for cyber security and cyber defense. Mathematics 8 (3): 13. https://doi.org/10.3390/math8030460 .

Javeed, D., T.H. Gao, and M.T. Khan. 2021. SDN-enabled hybrid DL-driven framework for the detection of emerging cyber threats in IoT. Electronics 10 (8): 16. https://doi.org/10.3390/electronics10080918 .

Johnson, P., D. Gorton, R. Lagerstrom, and M. Ekstedt. 2016. Time between vulnerability disclosures: A measure of software product vulnerability. Computers & Security 62: 278–295. https://doi.org/10.1016/j.cose.2016.08.004 .

Johnson, P., R. Lagerström, M. Ekstedt, and U. Franke. 2018. Can the common vulnerability scoring system be trusted? A Bayesian analysis. IEEE Transactions on Dependable and Secure Computing 15 (6): 1002–1015. https://doi.org/10.1109/TDSC.2016.2644614 .

Junger, M., V. Wang, and M. Schlömer. 2020. Fraud against businesses both online and offline: Crime scripts, business characteristics, efforts, and benefits. Crime Science 9 (1): 13. https://doi.org/10.1186/s40163-020-00119-4 .

Kalutarage, H.K., H.N. Nguyen, and S.A. Shaikh. 2017. Towards a threat assessment framework for apps collusion. Telecommunication Systems 66 (3): 417–430. https://doi.org/10.1007/s11235-017-0296-1 .

Kamarudin, M.H., C. Maple, T. Watson, and N.S. Safa. 2017. A LogitBoost-based algorithm for detecting known and unknown web attacks. IEEE Access 5: 26190–26200. https://doi.org/10.1109/ACCESS.2017.2766844 .

Kasongo, S.M., and Y.X. Sun. 2020. A deep learning method with wrapper based feature extraction for wireless intrusion detection system. Computers & Security 92: 15. https://doi.org/10.1016/j.cose.2020.101752 .

Keserwani, P.K., M.C. Govil, E.S. Pilli, and P. Govil. 2021. A smart anomaly-based intrusion detection system for the Internet of Things (IoT) network using GWO–PSO–RF model. Journal of Reliable Intelligent Environments 7 (1): 3–21. https://doi.org/10.1007/s40860-020-00126-x .

Keshk, M., E. Sitnikova, N. Moustafa, J. Hu, and I. Khalil. 2021. An integrated framework for privacy-preserving based anomaly detection for cyber-physical systems. IEEE Transactions on Sustainable Computing 6 (1): 66–79. https://doi.org/10.1109/TSUSC.2019.2906657 .

Khan, I.A., D.C. Pi, A.K. Bhatia, N. Khan, W. Haider, and A. Wahab. 2020. Generating realistic IoT-based IDS dataset centred on fuzzy qualitative modelling for cyber-physical systems. Electronics Letters 56 (9): 441–443. https://doi.org/10.1049/el.2019.4158 .

Khraisat, A., I. Gondal, P. Vamplew, J. Kamruzzaman, and A. Alazab. 2020. Hybrid intrusion detection system based on the stacking ensemble of C5 decision tree classifier and one class support vector machine. Electronics 9 (1): 18. https://doi.org/10.3390/electronics9010173 .

Khraisat, A., I. Gondal, P. Vamplew, and J. Kamruzzaman. 2019. Survey of intrusion detection systems: Techniques, datasets and challenges. Cybersecurity 2 (1): 20. https://doi.org/10.1186/s42400-019-0038-7 .

Kilincer, I.F., F. Ertam, and A. Sengur. 2021. Machine learning methods for cyber security intrusion detection: Datasets and comparative study. Computer Networks 188: 16. https://doi.org/10.1016/j.comnet.2021.107840 .

Kim, D., and H.K. Kim. 2019. Automated dataset generation system for collaborative research of cyber threat analysis. Security and Communication Networks 2019: 10. https://doi.org/10.1155/2019/6268476 .

Kim, G., C. Lee, J. Jo, and H. Lim. 2020. Automatic extraction of named entities of cyber threats using a deep Bi-LSTM-CRF network. International Journal of Machine Learning and Cybernetics 11 (10): 2341–2355. https://doi.org/10.1007/s13042-020-01122-6 .

Kirubavathi, G., and R. Anitha. 2016. Botnet detection via mining of traffic flow characteristics. Computers & Electrical Engineering 50: 91–101. https://doi.org/10.1016/j.compeleceng.2016.01.012 .

Kiwia, D., A. Dehghantanha, K.K.R. Choo, and J. Slaughter. 2018. A cyber kill chain based taxonomy of banking Trojans for evolutionary computational intelligence. Journal of Computational Science 27: 394–409. https://doi.org/10.1016/j.jocs.2017.10.020 .

Koroniotis, N., N. Moustafa, and E. Sitnikova. 2020. A new network forensic framework based on deep learning for Internet of Things networks: A particle deep framework. Future Generation Computer Systems 110: 91–106. https://doi.org/10.1016/j.future.2020.03.042 .

Kruse, C.S., B. Frederick, T. Jacobson, and D. Kyle Monticone. 2017. Cybersecurity in healthcare: A systematic review of modern threats and trends. Technology and Health Care 25 (1): 1–10.

Kshetri, N. 2018. The economics of cyber-insurance. IT Professional 20 (6): 9–14. https://doi.org/10.1109/MITP.2018.2874210 .

Kumar, R., P. Kumar, R. Tripathi, G.P. Gupta, T.R. Gadekallu, and G. Srivastava. 2021. SP2F: A secured privacy-preserving framework for smart agricultural Unmanned Aerial Vehicles. Computer Networks . https://doi.org/10.1016/j.comnet.2021.107819 .

Kumar, R., and R. Tripathi. 2021. DBTP2SF: A deep blockchain-based trustworthy privacy-preserving secured framework in industrial internet of things systems. Transactions on Emerging Telecommunications Technologies 32 (4): 27. https://doi.org/10.1002/ett.4222 .

Laso, P.M., D. Brosset, and J. Puentes. 2017. Dataset of anomalies and malicious acts in a cyber-physical subsystem. Data in Brief 14: 186–191. https://doi.org/10.1016/j.dib.2017.07.038 .

Lee, J., J. Kim, I. Kim, and K. Han. 2019. Cyber threat detection based on artificial neural networks using event profiles. IEEE Access 7: 165607–165626. https://doi.org/10.1109/ACCESS.2019.2953095 .

Lee, S.J., P.D. Yoo, A.T. Asyhari, Y. Jhi, L. Chermak, C.Y. Yeun, and K. Taha. 2020. IMPACT: Impersonation attack detection via edge computing using deep Autoencoder and feature abstraction. IEEE Access 8: 65520–65529. https://doi.org/10.1109/ACCESS.2020.2985089 .

Leong, Y.-Y., and Y.-C. Chen. 2020. Cyber risk cost and management in IoT devices-linked health insurance. The Geneva Papers on Risk and Insurance—Issues and Practice 45 (4): 737–759. https://doi.org/10.1057/s41288-020-00169-4 .

Levi, M. 2017. Assessing the trends, scale and nature of economic cybercrimes: overview and Issues: In Cybercrimes, cybercriminals and their policing, in crime, law and social change. Crime, Law and Social Change 67 (1): 3–20. https://doi.org/10.1007/s10611-016-9645-3 .

Li, C., K. Mills, D. Niu, R. Zhu, H. Zhang, and H. Kinawi. 2019a. Android malware detection based on factorization machine. IEEE Access 7: 184008–184019. https://doi.org/10.1109/ACCESS.2019.2958927 .

Li, D.Q., and Q.M. Li. 2020. Adversarial deep ensemble: evasion attacks and defenses for malware detection. IEEE Transactions on Information Forensics and Security 15: 3886–3900. https://doi.org/10.1109/tifs.2020.3003571 .

Li, D.Q., Q.M. Li, Y.F. Ye, and S.H. Xu. 2021. A framework for enhancing deep neural networks against adversarial malware. IEEE Transactions on Network Science and Engineering 8 (1): 736–750. https://doi.org/10.1109/tnse.2021.3051354 .

Li, R.H., C. Zhang, C. Feng, X. Zhang, and C.J. Tang. 2019b. Locating vulnerability in binaries using deep neural networks. IEEE Access 7: 134660–134676. https://doi.org/10.1109/access.2019.2942043 .

Li, X., M. Xu, P. Vijayakumar, N. Kumar, and X. Liu. 2020. Detection of low-frequency and multi-stage attacks in industrial Internet of Things. IEEE Transactions on Vehicular Technology 69 (8): 8820–8831. https://doi.org/10.1109/TVT.2020.2995133 .

Liu, H.Y., and B. Lang. 2019. Machine learning and deep learning methods for intrusion detection systems: A survey. Applied Sciences—Basel 9 (20): 28. https://doi.org/10.3390/app9204396 .

Lopez-Martin, M., B. Carro, and A. Sanchez-Esguevillas. 2020. Application of deep reinforcement learning to intrusion detection for supervised problems. Expert Systems with Applications . https://doi.org/10.1016/j.eswa.2019.112963 .

Loukas, G., D. Gan, and Tuan Vuong. 2013. A review of cyber threats and defence approaches in emergency management. Future Internet 5: 205–236.

Luo, C.C., S. Su, Y.B. Sun, Q.J. Tan, M. Han, and Z.H. Tian. 2020. A convolution-based system for malicious URLs detection. CMC—Computers Materials Continua 62 (1): 399–411.

Mahbooba, B., M. Timilsina, R. Sahal, and M. Serrano. 2021. Explainable artificial intelligence (XAI) to enhance trust management in intrusion detection systems using decision tree model. Complexity 2021: 11. https://doi.org/10.1155/2021/6634811 .

Mahdavifar, S., and A.A. Ghorbani. 2020. DeNNeS: Deep embedded neural network expert system for detecting cyber attacks. Neural Computing & Applications 32 (18): 14753–14780. https://doi.org/10.1007/s00521-020-04830-w .

Mahfouz, A., A. Abuhussein, D. Venugopal, and S. Shiva. 2020. Ensemble classifiers for network intrusion detection using a novel network attack dataset. Future Internet 12 (11): 1–19. https://doi.org/10.3390/fi12110180 .

Maleks Smith, Z., E. Lostri, and J.A. Lewis. 2020. The hidden costs of cybercrime. https://www.mcafee.com/enterprise/en-us/assets/reports/rp-hidden-costs-of-cybercrime.pdf . Accessed 16 May 2021.

Malik, J., A. Akhunzada, I. Bibi, M. Imran, A. Musaddiq, and S.W. Kim. 2020. Hybrid deep learning: An efficient reconnaissance and surveillance detection mechanism in SDN. IEEE Access 8: 134695–134706. https://doi.org/10.1109/ACCESS.2020.3009849 .

Manimurugan, S. 2020. IoT-Fog-Cloud model for anomaly detection using improved Naive Bayes and principal component analysis. Journal of Ambient Intelligence and Humanized Computing . https://doi.org/10.1007/s12652-020-02723-3 .

Martin, A., R. Lara-Cabrera, and D. Camacho. 2019. Android malware detection through hybrid features fusion and ensemble classifiers: The AndroPyTool framework and the OmniDroid dataset. Information Fusion 52: 128–142. https://doi.org/10.1016/j.inffus.2018.12.006 .

Mauro, M.D., G. Galatro, and A. Liotta. 2020. Experimental review of neural-based approaches for network intrusion management. IEEE Transactions on Network and Service Management 17 (4): 2480–2495. https://doi.org/10.1109/TNSM.2020.3024225 .

McLeod, A., and D. Dolezel. 2018. Cyber-analytics: Modeling factors associated with healthcare data breaches. Decision Support Systems 108: 57–68. https://doi.org/10.1016/j.dss.2018.02.007 .

Meira, J., R. Andrade, I. Praca, J. Carneiro, V. Bolon-Canedo, A. Alonso-Betanzos, and G. Marreiros. 2020. Performance evaluation of unsupervised techniques in cyber-attack anomaly detection. Journal of Ambient Intelligence and Humanized Computing 11 (11): 4477–4489. https://doi.org/10.1007/s12652-019-01417-9 .

Miao, Y., J. Ma, X. Liu, J. Weng, H. Li, and H. Li. 2019. Lightweight fine-grained search over encrypted data in Fog computing. IEEE Transactions on Services Computing 12 (5): 772–785. https://doi.org/10.1109/TSC.2018.2823309 .

Miller, C., and C. Valasek. 2015. Remote exploitation of an unaltered passenger vehicle. Black Hat USA 2015 (S 91).

Mireles, J.D., E. Ficke, J.H. Cho, P. Hurley, and S.H. Xu. 2019. Metrics towards measuring cyber agility. IEEE Transactions on Information Forensics and Security 14 (12): 3217–3232. https://doi.org/10.1109/tifs.2019.2912551 .

Mishra, N., and S. Pandya. 2021. Internet of Things applications, security challenges, attacks, intrusion detection, and future visions: A systematic review. IEEE Access . https://doi.org/10.1109/ACCESS.2021.3073408 .

Monshizadeh, M., V. Khatri, B.G. Atli, R. Kantola, and Z. Yan. 2019. Performance evaluation of a combined anomaly detection platform. IEEE Access 7: 100964–100978. https://doi.org/10.1109/ACCESS.2019.2930832 .

Moreno, V.C., G. Reniers, E. Salzano, and V. Cozzani. 2018. Analysis of physical and cyber security-related events in the chemical and process industry. Process Safety and Environmental Protection 116: 621–631. https://doi.org/10.1016/j.psep.2018.03.026 .

Moro, E.D. 2020. Towards an economic cyber loss index for parametric cover based on IT security indicator: A preliminary analysis. Risks . https://doi.org/10.3390/risks8020045 .

Moustafa, N., E. Adi, B. Turnbull, and J. Hu. 2018. A new threat intelligence scheme for safeguarding industry 4.0 systems. IEEE Access 6: 32910–32924. https://doi.org/10.1109/ACCESS.2018.2844794 .

Moustakidis, S., and P. Karlsson. 2020. A novel feature extraction methodology using Siamese convolutional neural networks for intrusion detection. Cybersecurity . https://doi.org/10.1186/s42400-020-00056-4 .

Mukhopadhyay, A., S. Chatterjee, K.K. Bagchi, P.J. Kirs, and G.K. Shukla. 2019. Cyber Risk Assessment and Mitigation (CRAM) framework using Logit and Probit models for cyber insurance. Information Systems Frontiers 21 (5): 997–1018. https://doi.org/10.1007/s10796-017-9808-5 .

Murphey, H. 2021a. Biden signs executive order to strengthen US cyber security. https://www.ft.com/content/4d808359-b504-4014-85f6-68e7a2851bf1?accessToken=zwAAAXl0_ifgkc9NgINZtQRAFNOF9mjnooUb8Q.MEYCIQDw46SFWsMn1iyuz3kvgAmn6mxc0rIVfw10Lg1ovJSfJwIhAK2X2URzfSqHwIS7ddRCvSt2nGC2DcdoiDTG49-4TeEt&sharetype=gift?token=fbcd6323-1ecf-4fc3-b136-b5b0dd6a8756 . Accessed 7 May 2021.

Murphey, H. 2021b. Millions of connected devices have security flaws, study shows. https://www.ft.com/content/0bf92003-926d-4dee-87d7-b01f7c3e9621?accessToken=zwAAAXnA7f2Ikc8L-SADkm1N7tOH17AffD6WIQ.MEQCIDjBuROvhmYV0Mx3iB0cEV7m5oND1uaCICxJu0mzxM0PAiBam98q9zfHiTB6hKGr1gGl0Azt85yazdpX9K5sI8se3Q&sharetype=gift?token=2538218d-77d9-4dd3-9649-3cb556a34e51 . Accessed 6 May 2021.

Murugesan, V., M. Shalinie, and M.H. Yang. 2018. Design and analysis of hybrid single packet IP traceback scheme. IET Networks 7 (3): 141–151. https://doi.org/10.1049/iet-net.2017.0115 .

Mwitondi, K.S., and S.A. Zargari. 2018. An iterative multiple sampling method for intrusion detection. Information Security Journal 27 (4): 230–239. https://doi.org/10.1080/19393555.2018.1539790 .

Neto, N.N., S. Madnick, A.M.G. De Paula, and N.M. Borges. 2021. Developing a global data breach database and the challenges encountered. ACM Journal of Data and Information Quality 13 (1): 33. https://doi.org/10.1145/3439873 .

Nurse, J.R.C., L. Axon, A. Erola, I. Agrafiotis, M. Goldsmith, and S. Creese. 2020. The data that drives cyber insurance: A study into the underwriting and claims processes. In 2020 International conference on cyber situational awareness, data analytics and assessment (CyberSA), 15–19 June 2020.

Oliveira, N., I. Praca, E. Maia, and O. Sousa. 2021. Intelligent cyber attack detection and classification for network-based intrusion detection systems. Applied Sciences—Basel 11 (4): 21. https://doi.org/10.3390/app11041674 .

Page, M.J. et al. 2021. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. Systematic Reviews 10 (1): 89. https://doi.org/10.1186/s13643-021-01626-4 .

Pajouh, H.H., R. Javidan, R. Khayami, A. Dehghantanha, and K.R. Choo. 2019. A two-layer dimension reduction and two-tier classification model for anomaly-based intrusion detection in IoT backbone networks. IEEE Transactions on Emerging Topics in Computing 7 (2): 314–323. https://doi.org/10.1109/TETC.2016.2633228 .

Parra, G.D., P. Rad, K.K.R. Choo, and N. Beebe. 2020. Detecting Internet of Things attacks using distributed deep learning. Journal of Network and Computer Applications 163: 13. https://doi.org/10.1016/j.jnca.2020.102662 .

Paté-Cornell, M.E., M. Kuypers, M. Smith, and P. Keller. 2018. Cyber risk management for critical infrastructure: A risk analysis model and three case studies. Risk Analysis 38 (2): 226–241. https://doi.org/10.1111/risa.12844 .

Pooser, D.M., M.J. Browne, and O. Arkhangelska. 2018. Growth in the perception of cyber risk: evidence from U.S. P&C Insurers. The Geneva Papers on Risk and Insurance—Issues and Practice 43 (2): 208–223. https://doi.org/10.1057/s41288-017-0077-9 .

Pu, G., L. Wang, J. Shen, and F. Dong. 2021. A hybrid unsupervised clustering-based anomaly detection method. Tsinghua Science and Technology 26 (2): 146–153. https://doi.org/10.26599/TST.2019.9010051 .

Qiu, J., W. Luo, L. Pan, Y. Tai, J. Zhang, and Y. Xiang. 2019. Predicting the impact of android malicious samples via machine learning. IEEE Access 7: 66304–66316. https://doi.org/10.1109/ACCESS.2019.2914311 .

Qu, X., L. Yang, K. Guo, M. Sun, L. Ma, T. Feng, S. Ren, K. Li, and X. Ma. 2020. Direct batch growth hierarchical self-organizing mapping based on statistics for efficient network intrusion detection. IEEE Access 8: 42251–42260. https://doi.org/10.1109/ACCESS.2020.2976810 .

Rahman, Md.S., S. Halder, Md. Ashraf Uddin, and U.K. Acharjee. 2021. An efficient hybrid system for anomaly detection in social networks. Cybersecurity 4 (1): 10. https://doi.org/10.1186/s42400-021-00074-w .

Ramaiah, M., V. Chandrasekaran, V. Ravi, and N. Kumar. 2021. An intrusion detection system using optimized deep neural network architecture. Transactions on Emerging Telecommunications Technologies 32 (4): 17. https://doi.org/10.1002/ett.4221 .

Raman, M.R.G., K. Kannan, S.K. Pal, and V.S.S. Sriram. 2016. Rough set-hypergraph-based feature selection approach for intrusion detection systems. Defence Science Journal 66 (6): 612–617. https://doi.org/10.14429/dsj.66.10802 .

Rathore, S., J.H. Park. 2018. Semi-supervised learning based distributed attack detection framework for IoT. Applied Soft Computing 72: 79–89. https://doi.org/10.1016/j.asoc.2018.05.049 .

Romanosky, S., L. Ablon, A. Kuehn, and T. Jones. 2019. Content analysis of cyber insurance policies: How do carriers price cyber risk? Journal of Cybersecurity (oxford) 5 (1): tyz002.

Sarabi, A., P. Naghizadeh, Y. Liu, and M. Liu. 2016. Risky business: Fine-grained data breach prediction using business profiles. Journal of Cybersecurity 2 (1): 15–28. https://doi.org/10.1093/cybsec/tyw004 .

Sardi, Alberto, Alessandro Rizzi, Enrico Sorano, and Anna Guerrieri. 2021. Cyber risk in health facilities: A systematic literature review. Sustainability 12 (17): 7002.

Sarker, Iqbal H., A.S.M. Kayes, Shahriar Badsha, Hamed Alqahtani, Paul Watters, and Alex Ng. 2020. Cybersecurity data science: An overview from machine learning perspective. Journal of Big Data 7 (1): 41. https://doi.org/10.1186/s40537-020-00318-5 .

Scopus. 2021. Factsheet. https://www.elsevier.com/__data/assets/pdf_file/0017/114533/Scopus_GlobalResearch_Factsheet2019_FINAL_WEB.pdf . Accessed 11 May 2021.

Sentuna, A., A. Alsadoon, P.W.C. Prasad, M. Saadeh, and O.H. Alsadoon. 2021. A novel Enhanced Naïve Bayes Posterior Probability (ENBPP) using machine learning: Cyber threat analysis. Neural Processing Letters 53 (1): 177–209. https://doi.org/10.1007/s11063-020-10381-x .

Shaukat, K., S.H. Luo, V. Varadharajan, I.A. Hameed, S. Chen, D.X. Liu, and J.M. Li. 2020. Performance comparison and current challenges of using machine learning techniques in cybersecurity. Energies 13 (10): 27. https://doi.org/10.3390/en13102509 .

Sheehan, B., F. Murphy, M. Mullins, and C. Ryan. 2019. Connected and autonomous vehicles: A cyber-risk classification framework. Transportation Research Part a: Policy and Practice 124: 523–536. https://doi.org/10.1016/j.tra.2018.06.033 .

Sheehan, B., F. Murphy, A.N. Kia, and R. Kiely. 2021. A quantitative bow-tie cyber risk classification and assessment framework. Journal of Risk Research 24 (12): 1619–1638.

Shlomo, A., M. Kalech, and R. Moskovitch. 2021. Temporal pattern-based malicious activity detection in SCADA systems. Computers & Security 102: 17. https://doi.org/10.1016/j.cose.2020.102153 .

Singh, K.J., and T. De. 2020. Efficient classification of DDoS attacks using an ensemble feature selection algorithm. Journal of Intelligent Systems 29 (1): 71–83. https://doi.org/10.1515/jisys-2017-0472 .

Skrjanc, I., S. Ozawa, T. Ban, and D. Dovzan. 2018. Large-scale cyber attacks monitoring using Evolving Cauchy Possibilistic Clustering. Applied Soft Computing 62: 592–601. https://doi.org/10.1016/j.asoc.2017.11.008 .

Smart, W. 2018. Lessons learned review of the WannaCry Ransomware Cyber Attack. https://www.england.nhs.uk/wp-content/uploads/2018/02/lessons-learned-review-wannacry-ransomware-cyber-attack-cio-review.pdf . Accessed 7 May 2021.

Sornette, D., T. Maillart, and W. Kröger. 2013. Exploring the limits of safety analysis in complex technological systems. International Journal of Disaster Risk Reduction 6: 59–66. https://doi.org/10.1016/j.ijdrr.2013.04.002 .

Sovacool, B.K. 2008. The costs of failure: A preliminary assessment of major energy accidents, 1907–2007. Energy Policy 36 (5): 1802–1820. https://doi.org/10.1016/j.enpol.2008.01.040 .

SpringerLink. 2021. Journal Search. https://rd.springer.com/search?facet-content-type=%22Journal%22 . Accessed 11 May 2021.

Stojanovic, B., K. Hofer-Schmitz, and U. Kleb. 2020. APT datasets and attack modeling for automated detection methods: A review. Computers & Security 92: 19. https://doi.org/10.1016/j.cose.2020.101734 .

Subroto, A., and A. Apriyana. 2019. Cyber risk prediction through social media big data analytics and statistical machine learning. Journal of Big Data . https://doi.org/10.1186/s40537-019-0216-1 .

Tan, Z., A. Jamdagni, X. He, P. Nanda, R.P. Liu, and J. Hu. 2015. Detection of denial-of-service attacks based on computer vision techniques. IEEE Transactions on Computers 64 (9): 2519–2533. https://doi.org/10.1109/TC.2014.2375218 .

Tidy, J. 2021. Irish cyber-attack: Hackers bail out Irish health service for free. https://www.bbc.com/news/world-europe-57197688 . Accessed 6 May 2021.

Tuncer, T., F. Ertam, and S. Dogan. 2020. Automated malware recognition method based on local neighborhood binary pattern. Multimedia Tools and Applications 79 (37–38): 27815–27832. https://doi.org/10.1007/s11042-020-09376-6 .

Uhm, Y., and W. Pak. 2021. Service-aware two-level partitioning for machine learning-based network intrusion detection with high performance and high scalability. IEEE Access 9: 6608–6622. https://doi.org/10.1109/ACCESS.2020.3048900 .

Ulven, J.B., and G. Wangen. 2021. A systematic review of cybersecurity risks in higher education. Future Internet 13 (2): 1–40. https://doi.org/10.3390/fi13020039 .

Vaccari, I., G. Chiola, M. Aiello, M. Mongelli, and E. Cambiaso. 2020. MQTTset, a new dataset for machine learning techniques on MQTT. Sensors 20 (22): 17. https://doi.org/10.3390/s20226578 .

Valeriano, B., and R.C. Maness. 2014. The dynamics of cyber conflict between rival antagonists, 2001–11. Journal of Peace Research 51 (3): 347–360. https://doi.org/10.1177/0022343313518940 .

Varghese, J.E., and B. Muniyal. 2021. An Efficient IDS framework for DDoS attacks in SDN environment. IEEE Access 9: 69680–69699. https://doi.org/10.1109/ACCESS.2021.3078065 .

Varsha, M. V., P. Vinod, K.A. Dhanya. 2017 Identification of malicious android app using manifest and opcode features. Journal of Computer Virology and Hacking Techniques 13 (2): 125–138. https://doi.org/10.1007/s11416-016-0277-z

Velliangiri, S., and H.M. Pandey. 2020. Fuzzy-Taylor-elephant herd optimization inspired Deep Belief Network for DDoS attack detection and comparison with state-of-the-arts algorithms. Future Generation Computer Systems—the International Journal of Escience 110: 80–90. https://doi.org/10.1016/j.future.2020.03.049 .

Verma, A., and V. Ranga. 2020. Machine learning based intrusion detection systems for IoT applications. Wireless Personal Communications 111 (4): 2287–2310. https://doi.org/10.1007/s11277-019-06986-8 .

Vidros, S., C. Kolias, G. Kambourakis, and L. Akoglu. 2017. Automatic detection of online recruitment frauds: Characteristics, methods, and a public dataset. Future Internet 9 (1): 19. https://doi.org/10.3390/fi9010006 .

Vinayakumar, R., M. Alazab, K.P. Soman, P. Poornachandran, A. Al-Nemrat, and S. Venkatraman. 2019. Deep learning approach for intelligent intrusion detection system. IEEE Access 7: 41525–41550. https://doi.org/10.1109/access.2019.2895334 .

Walker-Roberts, S., M. Hammoudeh, O. Aldabbas, M. Aydin, and A. Dehghantanha. 2020. Threats on the horizon: Understanding security threats in the era of cyber-physical systems. Journal of Supercomputing 76 (4): 2643–2664. https://doi.org/10.1007/s11227-019-03028-9 .

Web of Science. 2021. Web of Science: Science Citation Index Expanded. https://clarivate.com/webofsciencegroup/solutions/webofscience-scie/ . Accessed 11 May 2021.

World Economic Forum. 2020. WEF Global Risk Report. http://www3.weforum.org/docs/WEF_Global_Risk_Report_2020.pdf . Accessed 13 May 2020.

Xin, Y., L. Kong, Z. Liu, Y. Chen, Y. Li, H. Zhu, M. Gao, H. Hou, and C. Wang. 2018. Machine learning and deep learning methods for cybersecurity. IEEE Access 6: 35365–35381. https://doi.org/10.1109/ACCESS.2018.2836950 .

Xu, C., J. Zhang, K. Chang, and C. Long. 2013. Uncovering collusive spammers in Chinese review websites. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management.

Yang, J., T. Li, G. Liang, W. He, and Y. Zhao. 2019. A Simple recurrent unit model based intrusion detection system with DCGAN. IEEE Access 7: 83286–83296. https://doi.org/10.1109/ACCESS.2019.2922692 .

Yuan, B.G., J.F. Wang, D. Liu, W. Guo, P. Wu, and X.H. Bao. 2020. Byte-level malware classification based on Markov images and deep learning. Computers & Security 92: 12. https://doi.org/10.1016/j.cose.2020.101740 .

Zhang, S., X.M. Ou, and D. Caragea. 2015. Predicting cyber risks through national vulnerability database. Information Security Journal 24 (4–6): 194–206. https://doi.org/10.1080/19393555.2015.1111961 .

Zhang, Y., P. Li, and X. Wang. 2019. Intrusion detection for IoT based on improved genetic algorithm and deep belief network. IEEE Access 7: 31711–31722.

Zheng, Muwei, Hannah Robbins, Zimo Chai, Prakash Thapa, and Tyler Moore. 2018. Cybersecurity research datasets: taxonomy and empirical analysis. In 11th {USENIX} workshop on cyber security experimentation and test ({CSET} 18).

Zhou, X., W. Liang, S. Shimizu, J. Ma, and Q. Jin. 2021. Siamese neural network based few-shot learning for anomaly detection in industrial cyber-physical systems. IEEE Transactions on Industrial Informatics 17 (8): 5790–5798. https://doi.org/10.1109/TII.2020.3047675 .

Zhou, Y.Y., G. Cheng, S.Q. Jiang, and M. Dai. 2020. Building an efficient intrusion detection system based on feature selection and ensemble classifier. Computer Networks 174: 17. https://doi.org/10.1016/j.comnet.2020.107247 .

Download references

Open Access funding provided by the IReL Consortium.

Author information

Authors and affiliations.

University of Limerick, Limerick, Ireland

Frank Cremer, Barry Sheehan, Arash N. Kia, Martin Mullins & Finbarr Murphy

TH Köln University of Applied Sciences, Cologne, Germany

Michael Fortmann & Stefan Materne

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Barry Sheehan .

Ethics declarations

Conflict of interest.

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (PDF 334 kb)

Supplementary file1 (docx 418 kb), rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cremer, F., Sheehan, B., Fortmann, M. et al. Cyber risk and cybersecurity: a systematic review of data availability. Geneva Pap Risk Insur Issues Pract 47 , 698–736 (2022). https://doi.org/10.1057/s41288-022-00266-6

Download citation

Received : 15 June 2021

Accepted : 20 January 2022

Published : 17 February 2022

Issue Date : July 2022

DOI : https://doi.org/10.1057/s41288-022-00266-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Cyber insurance
  • Systematic review
  • Cybersecurity
  • Find a journal
  • Publish with us
  • Track your research

TechRepublic

Male system administrator of big data center typing on laptop computer while working in server room. Programming digital operation. Man engineer working online in database center. Telecommunication.

8 Best Data Science Tools and Software

Apache Spark and Hadoop, Microsoft Power BI, Jupyter Notebook and Alteryx are among the top data science tools for finding business insights. Compare their features, pros and cons.

AI act trilogue press conference.

EU’s AI Act: Europe’s New Rules for Artificial Intelligence

Europe's AI legislation, adopted March 13, attempts to strike a tricky balance between promoting innovation and protecting citizens' rights.

Concept image of a woman analyzing data.

10 Best Predictive Analytics Tools and Software for 2024

Tableau, TIBCO Data Science, IBM and Sisense are among the best software for predictive analytics. Explore their features, pricing, pros and cons to find the best option for your organization.

Tableau logo.

Tableau Review: Features, Pricing, Pros and Cons

Tableau has three pricing tiers that cater to all kinds of data teams, with capabilities like accelerators and real-time analytics. And if Tableau doesn’t meet your needs, it has a few alternatives worth noting.

Futuristic concept art for big data solution for enterprises.

Top 6 Enterprise Data Storage Solutions for 2024

Amazon, IDrive, IBM, Google, NetApp and Wasabi offer some of the top enterprise data storage solutions. Explore their features and benefits, and find the right solution for your organization's needs.

Latest Articles

Businessman holding a virtual shield with check mark,

How Can Businesses Defend Themselves Against Common Cyberthreats?

TechRepublic consolidated expert advice on how businesses can defend themselves against the most common cyberthreats, including zero-days, ransomware and deepfakes.

CRM displayed on a monitor and surrounded by flat icons of CRM features.

Top 10 CRM Features and Functionalities

Discover the top CRM features for business success. Explore a curated list of key capabilities to consider when choosing the right CRM solution, including marketing tools, activity tracking and more.

Cubes, dice or blocks with deep fake letters.

Combatting Deepfakes in Australia: Content Credentials is the Start

The production of deepfakes is accelerating at more than 1,500% in Australia, forcing organisations to create and adopt standards like Content Credentials.

Pipedrive logo.

The Top 5 Pipedrive Alternatives for 2024

Discover the top alternatives to Pipedrive. Explore a curated list of CRM platforms with similar features, pricing and pros and cons to find the best fit for your business.

Technology background with national flag of Australia.

The Australian Government’s Manufacturing Objectives Rely on IT Capabilities

The intent of the Future Made in Australia Act is to build manufacturing capabilities across all sectors, which will likely lead to more demand for IT skills and services.

Businessman add new skill or gear into human head to upgrade working skill.

Udemy Report: Which IT Skills Are Most in Demand in Q1 2024?

Informatica PowerCenter, Microsoft Playwright and Oracle Database SQL top Udemy’s list of most popular tech courses.

Digital map of Australia,

Gartner: 4 Bleeding-Edge Technologies in Australia

Gartner recently identified emerging tech that will impact enterprise leaders in APAC. Here’s what IT leaders in Australia need to know about these innovative technologies.

research paper on security in big data

Llama 3 Cheat Sheet: A Complete Guide for 2024

Learn how to access Meta’s new AI model Llama 3, which sets itself apart by being open to use under a license agreement.

Zoho vs Salesforce.

Zoho vs Salesforce (2024): Which CRM Is Better?

Look at Zoho CRM and Salesforce side-by-side to compare the cost per functionality and top pros and of each provider to determine which is better for your business needs.

Businessman hand holding glowing digital brain.

9 Innovative Use Cases of AI in Australian Businesses in 2024

Australian businesses are beginning to effectively grapple with AI and build solutions specific to their needs. Here are notable use cases of businesses using AI.

An illustration of a monthly salary of a happy employee on year 2024.

How Are APAC Tech Salaries Faring in 2024?

The year 2024 is bringing a return to stable tech salary growth in APAC, with AI and data jobs leading the way. This follows downward salary pressure in 2023, after steep increases in previous years.

Splash graphic featuring the logo of Anthropic.

Anthropic Releases Claude Team Enterprise AI Plan and iOS App

The enterprise plan seeks to fill a need for generative AI tools for small and medium businesses. Plus, a Claude app is now on iOS.

Audience at conference hall.

Top Tech Conferences & Events to Add to Your Calendar in 2024

A great way to stay current with the latest technology trends and innovations is by attending conferences. Read and bookmark our 2024 tech events guide.

research paper on security in big data

TechRepublic Premium Editorial Calendar: Policies, Checklists, Hiring Kits and Glossaries for Download

TechRepublic Premium content helps you solve your toughest IT issues and jump-start your career or next project.

Close up of IBM logo at their headquarters located in SOMA district, downtown San Francisco.

IBM Acquires HashiCorp for $6.4 Billion, Expanding Hybrid Cloud Offerings

The deal is intended to strengthen IBM’s hybrid and multicloud offerings and generative AI deployment.

Create a TechRepublic Account

Get the web's best business technology news, tutorials, reviews, trends, and analysis—in your inbox. Let's start with the basics.

* - indicates required fields

Sign in to TechRepublic

Lost your password? Request a new password

Reset Password

Please enter your email adress. You will receive an email message with instructions on how to reset your password.

Check your email for a password reset link. If you didn't receive an email don't forgot to check your spam folder, otherwise contact support .

Welcome. Tell us a little bit about you.

This will help us provide you with customized content.

Want to receive more TechRepublic news?

You're all set.

Thanks for signing up! Keep an eye out for a confirmation email from our team. To ensure any newsletters you subscribed to hit your inbox, make sure to add [email protected] to your contacts list.

Research on the Big Data Security Application Based on Artificial Intelligence Technology in Operators

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

  • Artificial Intelligence
  • Generative AI
  • Business Operations
  • Cloud Computing
  • Data Center
  • Data Management
  • Emerging Technology
  • Enterprise Applications
  • IT Leadership
  • Digital Transformation
  • IT Strategy
  • IT Management
  • Diversity and Inclusion
  • IT Operations
  • Project Management
  • Software Development
  • Vendors and Providers
  • Enterprise Buyer’s Guides
  • United States
  • Middle East
  • España (Spain)
  • Italia (Italy)
  • Netherlands
  • United Kingdom
  • New Zealand
  • Data Analytics & AI
  • Newsletters
  • Foundry Careers
  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Copyright Notice
  • Member Preferences
  • About AdChoices
  • Your California Privacy Rights

Our Network

  • Computerworld
  • Network World

Yash Mehta

Innovative data integration in 2024: Pioneering the future of data integration

This article discusses the latest advancements in the data integration industry and how organisations can successfully integrate these technologies into their existing data strategy..

Innovative Data Integration in 2024: Pioneering the Future of Data Integration by Yash Mehta

In the age of big data, where information is generated at an unprecedented rate, the ability to integrate and manage diverse data sources has become a critical business imperative. Traditional data integration methods are often cumbersome, time-consuming, and unable to keep up with the rapidly evolving data landscape.

Of late, innovative data integration tools are revolutionising how organisations approach data management, unlocking new opportunities for growth, efficiency, and strategic decision-making by leveraging technical advancements in Artificial Intelligence, Machine Learning, and Natural Language Processing. This helps reduce the points of failure due to human intervention. But what does the future hold for the realm of data integration ? This article focuses on how these advancements are paving the way for data integration for the years to come in this ever-so-dynamic technological era.

AI-powered data integration

One of the most promising advancements in data integration is the integration of artificial intelligence (AI) and machine learning (ML) technologies. AI-powered data integration tools leverage advanced algorithms and predictive analytics to automate and streamline the data integration process. According to a recent report by InformationWeek , enterprises with a strong AI strategy are 3 times more likely to report above-average data integration success. Additionally, a study by McKinsey found that organisations leveraging AI in data integration can achieve an average improvement of 20% in data quality. Organisations widely use AI/ML for the following reasons.

Anomaly detection

Anomaly detection algorithms can identify unusual patterns in data that might indicate errors, fraud, or emerging trends. For instance, these algorithms can detect fraudulent credit card transactions or identify equipment malfunctions in sensor data.

Neural networks

Neural networks, inspired by the human brain, can learn complex patterns from data and use that knowledge to automate data mapping, transformation, and quality checks. This significantly reduces manual effort and improves the accuracy of data integration.

Natural Language Processing (NLP)

Natural Language Processing allows AI tools to understand and process human language. This is crucial for extracting insights from text-based data sources like social media feeds, customer reviews, and emails. NLP can also be used to automatically generate data pipelines based on the content of data sources.

An AI-powered data integration platform provider called K2view leverages machine learning for data discovery, schema matching, and data quality management. Their AI engine can automatically learn data structures and relationships, simplifying the integration process and minimising the need for manual configuration.

AI-powered data integration solutions are particularly effective in handling complex, unstructured data sources, such as social media feeds, sensor data, and customer interactions. By applying natural language processing (NLP) and computer vision techniques, these tools can extract meaningful insights and integrate data from diverse formats, languages, and locations. This enables organisations to unlock the full potential of their data assets, making informed decisions and driving innovative business strategies.

Serverless data integration

The rise of serverless computing has also transformed the data integration landscape. Serverless data integration platforms eliminate the need for traditional server infrastructure, allowing organisations to focus on the core functionality of their data integration processes rather than managing the underlying hardware and software.

According to a recent forecast by Grand View Research, the global serverless computing market is expected to reach a staggering $21.4 billion by 2025. Serverless data integration solutions leverage cloud-based services, such as AWS Lambda, Google Cloud Functions, or Azure Functions, to execute data integration tasks on demand without needing dedicated servers or resource provisioning. This approach offers several benefits, including scalability, cost-efficiency, and reduced maintenance overhead, as the cloud provider handles the infrastructure management and scaling.

A study by Flexera shows that organisations using serverless data integration platforms can achieve up to a 40% reduction in operational costs compared to traditional approaches. However, it’s important to consider some potential drawbacks of serverless architecture.

Vendor lock-in

Reliance on a specific cloud provider’s serverless functions can make it difficult to switch to another provider in the future.

Cold start times

When a serverless function hasn’t been used recently, there can be a slight delay (cold start) as the cloud provider allocates resources. This can impact performance for infrequently used integrations.

A leading serverless data integration solution, Flatfile, offers a user-friendly interface and pre-built connectors for various data sources, allowing businesses to set up data pipelines quickly and easily without managing servers. Their serverless architecture ensures automatic scaling and eliminates the need for complex infrastructure management.

By leveraging serverless technology, data integration teams can rapidly deploy and execute data pipelines, respond to changing business requirements, and scale their integration capabilities as needed, without the burden of managing complex server environments.

Data Integration as a Service (DIaaS)

Another innovative approach to data integration is the emergence of Data Integration as a Service (DIaaS) platforms. These cloud-based solutions offer a comprehensive suite of data integration tools and capabilities, delivered as a subscription-based service.

DIaaS platforms provide a centralised hub for managing data integration workflows, from data ingestion and transformation to data quality management and advanced analytics. By leveraging the expertise and infrastructure of specialised service providers, organisations can benefit from the latest data integration technologies, scalable resources, and continuous updates, without the need to invest in expensive on-premises infrastructure or maintain in-house data integration teams.

Security considerations

While DIaaS offers numerous advantages, it’s crucial to consider security implications when entrusting data to a cloud-based provider. Ensure the DIaaS platform employs robust security measures like rest and transit encryption, access controls, and regular security audits.

Skyvia is a prominent DIaaS platform that prioritises security by offering a secure and compliant environment for data integration. Its features include role-based access control, data encryption, and automated data masking. Their user-friendly interface and pre-built connectors simplify data integration tasks, making it suitable for businesses of all sizes.

IoT data integration

The rise of the Internet of Things (IoT) has introduced a new layer of complexity in data integration. IoT devices generate a vast amount of real-time data from various sensors, devices, and systems, creating the need for seamless integration and analysis of this information. According to a report by Statista , the global IoT market size is projected to surpass $1.6 trillion by 2025, highlighting the continued growth and importance of IoT devices. However, a recent survey by IDC found that 70% of organisations still struggle with managing and integrating IoT data, emphasising the ongoing need for specialised data integration tools.

Innovative data integration tools are addressing the challenges of IoT data integration by providing specialised capabilities, such as edge computing, data normalisation, and real-time data streaming. These tools can collect, process, and integrate data from IoT devices, ensuring that critical insights are extracted and made available to decision-makers promptly.

One key capability of IoT data integration tools is edge computing, which performs data processing and analysis closer to the IoT devices rather than relying on a centralised cloud infrastructure. This reduces latency, improves responsiveness, and enables real-time decision-making, crucial for time-sensitive IoT applications like predictive maintenance or autonomous systems.

Additionally, these tools often incorporate advanced data normalisation techniques to handle the heterogeneity of IoT data, which can come from various sensor types, communication protocols, and data formats. By standardising this missing piece in your data integration strategy, organisations can get a unified view of an operational environment and save millions of dollars.

By integrating IoT data with other enterprise data sources, organisations can comprehensively understand their operations, customer behaviour, and market dynamics. These insights enable them to optimise processes, enhance product development, and deliver personalised customer experiences, ultimately driving innovation and competitive advantage.

Innovative data integration tools empower businesses to unlock their data’s full potential. AI, serverless architectures, and DIaaS are transforming how organisations approach data, enabling them to make data-driven decisions that fuel growth and innovation. Emerging trends like real-time data integration and low-code/no-code tools promise to democratise data access and analysis further, making it even easier for businesses to leverage the power of data. However, ethical considerations like data privacy and bias remain paramount for responsible data integration practices.

Related content

Raj polanki’s five steps by which cios can lead holistically, where’s the roai, the cyber pandemic: ai deepfakes and the future of security and identity verification, the ai cat and mouse game has begun, from our editors straight to your inbox.

Yash Mehta

Yash Mehta is an internationally recognized Internet of Things (IoT), machine to machine (M2M) communications and big data technology expert. He has written a number of widely acknowledged articles on data science, IoT, business innovation, tools, security technologies, business strategies, development, etc. His articles have been featured on the most authoritative publications and awarded as one of the most innovative and influential work in the connected technology industry by IBM and Cisco IoT department. His work has been featured on leading industry platforms that have a specialization in big data science and M2M. His work was published in the featured category of IEEE Journal (worldwide edition - March 2016) and he was highlighted as a business intelligence expert. The opinions expressed in this blog are those of Yash Mehta and do not necessarily represent those of IDG Communications, Inc., its parent, subsidiary or affiliated companies.

More from this author

Cisos are not just the keepers of our data – they must be its custodians, explainer: building a high-performing last-mile delivery software, website spoofing: risks, threats, and mitigation strategies for cios, most popular authors.

research paper on security in big data

Show me more

Boomi burnishes api management capabilities.

Image

TIAA modernizes the customer journey with AI

Image

10 IT skills where expertise pays the most

Image

CIO Leadership Live with Sri Adusumilli, Chief Information Officer, TruckPro LLC

Image

CIO Leadership Live NZ with Tim Partington, Director Digital & Technology, Skills Consulting Group

Image

CIO Leadership Live UK with Charlene Hunter, MBE, CEO & Founder, Coding Black Females

Image

A Fundamental Question: Open or Closed Source?

Image

Sponsored Links

  • Everybody's ready for AI except your data. Unlock the power of AI with Informatica

IMAGES

  1. (PDF) Big Data Security Issues Research Paper

    research paper on security in big data

  2. (PDF) A Review on Data Security in Cloud Computing

    research paper on security in big data

  3. (PDF) Cyber security and artificial intelligence

    research paper on security in big data

  4. 👍 Database security research paper. Database security research paper

    research paper on security in big data

  5. (PDF) Impact of Big Data on Security

    research paper on security in big data

  6. Research paper on cloud computing security pdf

    research paper on security in big data

VIDEO

  1. Security thread uv fiber banknote watermark certificate cotton paper supplier factory

  2. The Devil's in the Data: Role of Data Governance in Cyber Risk Mitigation

  3. Qu’est-ce que le “big security data” ?

  4. How to prove EVM Hacking

  5. 🚀 Data Intelligence Company: Big Data, AI, and Security

  6. Utharamchal University Dehradun, Admissions Open for 2024

COMMENTS

  1. Discussion Data security governance in the era of big data: status, challenges, and prospects

    However, increasing big data security problems are exposed and the number of big data security incidents occurring around the world is on the rise. Against this background, this paper provides a summary of the present situation of global data security governance, points out the challenges, and then proceeds to raise solutions for further ...

  2. A key review on security and privacy of big data: issues, challenges

    Big data collection means collecting large volumes of data to have insight into better business decisions and greater customer satisfaction. Securing big data is difficult not just because of the large amount of data it handles, but also because of the continuous streaming of data, multiple types of data, and cloud-based data storage. Additionally, traditional security and privacy methods are ...

  3. Data Security in Big Data: Challenges, Strategies, and Future Trends

    In the dynamic landscape of big data analytics, this paper explores the critical dimension of data. security, addressing challenges, strategies, and emerging trends. Recognizing the exponential ...

  4. Privacy Prevention of Big Data Applications: A Systematic Literature

    Information security is a Big Data issue. The research community is focusing on the development in the period of Big Data, computer science, and increasing business applications of quick and efficient algorithms for Big Data security intelligence, with the primary aim of ensuring a safe environment free of unlawful access (Cheng et al., 2017).

  5. PDF Security and Privacy for Big Data: A Systematic Literature Review

    A discussion of the findings and directions for future work is presented in Section IV. Lastly, II. METHODOLOGY In this paper, we perform a systematic literature review (SLR) to document what security and privacy research exists within the big data area, and identify possible areas where further research is needed.

  6. Big Data and Cybersecurity: A Review of Key Privacy and Security

    Even as the big data industry grows every day, security issues are also rising to the top of today's technology priorities. Recent technological developments have increased the need for big data, leading to the outsourcing of both the data and the business applications to third parties. This has also raised questions about the security and privacy of large data. In the context of big data ...

  7. Cybersecurity in Big Data Era: From Securing Big Data to Data-Driven

    We highlight how big data is protected and how big data can also be used as a tool for cybersecurity. We summarize recent works in the form of tables and have presented trends, open research challenges and problems. With this paper, readers can have a more thorough understanding of cybersecurity in the big data era, as well as research trends ...

  8. PDF A key review on security and privacy of big data: issues ...

    Securing big data is difficult not just because of the large amount of data it handles, but also because of the continuous streaming of data, multiple types of data, and cloud-based data storage. Additionally, traditional security and privacy methods are insufficient due to large data volume, speed, and data from different sources.

  9. Security and privacy for big data: A systematic literature review

    Big data is currently a hot research topic, with four million hits on Google scholar in October 2016. One reason for the popularity of big data research is the knowledge that can be extracted from analyzing these large data sets. However, data can contain sensitive information, and data must therefore be sufficiently protected as it is stored and processed. Furthermore, it might also be ...

  10. Security and Privacy Issues of Big Data

    Considering Big Data there is a set of risk areas that need to be considered. These include the information lifecycle (provenance, ownership and classification of data), the data creation and collection process, and the lack of security procedures. Ultimately, the Big Data security objectives are no different from any

  11. Big data in cybersecurity: a survey of applications and future trends

    The paper categorizes applications into areas of intrusion and anomaly detection, spamming and spoofing detection, malware and ransomware detection, code security, cloud security, along with another category surveying other directions of research in big data and cybersecurity. The paper concludes with pointing to possible future directions in ...

  12. Cybersecurity data science: an overview from ...

    In a computing context, cybersecurity is undergoing massive shifts in technology and its operations in recent days, and data science is driving the change. Extracting security incident patterns or insights from cybersecurity data and building corresponding data-driven model, is the key to make a security system automated and intelligent. To understand and analyze the actual phenomena with data ...

  13. Research Challenges at the Intersection of Big Data, Security and

    On the other hand, while linking and sharing data across organizations, privacy/security issues need to be considered. Below, we provide an overview of novel research challenges that are at the intersection of cybersecurity, privacy and big data. Go to: 2. Storing and Querying Big Data.

  14. (PDF) Big Data Security Issues and Challenges

    Data is growing because of use of internet, smart. phone and social network. Big data is a collection of data sets which is very larg e in size as well as complex. Genera lly. size of the data is ...

  15. Big data in cybersecurity: a survey of applications and future trends

    The paper aims to introduce a thorough survey on the use of big data analytics in building, improving, or defying cybersecurity systems. This paper surveys state-of-the-art research in different ...

  16. Security and Privacy Issues in IoT-Based Big Data Cloud Systems in a

    To sum up, there is a need for more research in the area of Security and Management of BD in a CC environment, and thus, we propose a new encryption method of the data in a sustainable Cloud environment that will deal with and solve challenges such as the Management and Security of data, which, in many cases, are IoT-based Big Data.

  17. Big Data Security and Privacy Protection

    In view of the wide application and popularization of large data, more and more data security and privacy issues have brought great challenges to the development of large data. Starting from the characteristics of big data, this paper analyses various risks of information security, and puts forward the corresponding development strategy of big data security. The results show that the ...

  18. Security and Privacy in Big Data Environment

    This paper discusses scopes, challenges, and various techniques employed to secure privacy and security of Big Data. There is ample scope for further research in Big Data security and privacy of which few areas are discussed under the heading "Future Research Directions" of this write-up. However, in light of existing bottlenecks in data ...

  19. (PDF) Security and Privacy Issues of Big Data

    10. Supervising the Security Level in Real-Time. These security and privacy challenges cov er the entire spectrum of the Big D ata lifecycle (Figure 2): sources of data production (devices), the ...

  20. Differentially Private Synthetic Data with Private Density Estimation

    The need to analyze sensitive data, such as medical records or financial data, has created a critical research challenge in recent years. In this paper, we adopt the framework of differential privacy, and explore mechanisms for generating an entire dataset which accurately captures characteristics of the original data. We build upon the work of Boedihardjo et al, which laid the foundations for ...

  21. [2405.04760] Large Language Models for Cyber Security: A Systematic

    The rapid advancement of Large Language Models (LLMs) has opened up new opportunities for leveraging artificial intelligence in various domains, including cybersecurity. As the volume and sophistication of cyber threats continue to grow, there is an increasing need for intelligent systems that can automatically detect vulnerabilities, analyze malware, and respond to attacks. In this survey, we ...

  22. United States International Cyberspace & Digital Policy Strategy

    Central to our strategy is the effort to build digital solidarity - working together to offer mutual assistance to the victims of malicious cyber activity and other digital harms; assist partners - especially emerging economies - in deploying safe, secure, resilient, and sustainable technologies to advance their development goals; and builds strong and inclusive innovation economies that ...

  23. Cyber risk and cybersecurity: a systematic review of data availability

    Cybercrime is estimated to have cost the global economy just under USD 1 trillion in 2020, indicating an increase of more than 50% since 2018. With the average cyber insurance claim rising from USD 145,000 in 2019 to USD 359,000 in 2020, there is a growing necessity for better cyber information sources, standardised databases, mandatory reporting and public awareness. This research analyses ...

  24. Big Data: Latest Articles, News & Trends

    Big Data Big Data Tableau Review: Features, Pricing, Pros and Cons . Tableau has three pricing tiers that cater to all kinds of data teams, with capabilities like accelerators and real-time analytics.

  25. Research on the Big Data Security Application Based on Artificial

    The importance of data security and privacy is increasing under the trend of digital development and has become an important work research direction for government and enterprises, operators carry a large amount of data of users in the industry. This paper describes potential data security threats in the big data environment, especially operators, and proposes a security protection scheme ...

  26. Innovative data integration in 2024: Pioneering the future of data

    According to a recent forecast by Grand View Research, the global serverless computing market is expected to reach a staggering $21.4 billion by 2025. Serverless data integration solutions ...