Article Content
Abstract
In the past few years, the advancement of deep learning based AI technologies has accelerated the research on automatic software vulnerability detection. However, deep learning models trained with vulnerability data have limited learning ability with high rates of false negative (FN) and false positive (FP), which are mainly caused by small and imbalanced sample problems of vulnerability data. Therefore, we propose a multiperspective data augmentation approach (called MPDA) and apply it to enhance data quality to improve deep learning for software vulnerability detection. MPDA automatically augments software vulnerability data from different perspectives by its three augmenting components designed as augmenting by oversampling, augmenting by GAN, and augmenting by fuzzy sampling. We also design three algorithms, the Oversampling Strategy Selection (OSS) algorithm, the GAN-based data generating algorithm, and the Fuzzy Sampling Strategy Selection (FSS) algorithm, to help MPDA automatically achieve the optimal augmentation effect. The evaluation results on the Juliet Java Suite dataset and the application of MPDA to five widely used models in deep learning-based vulnerability detection to detect 29 types of vulnerability demonstrated that our approach consistently improved the performance of each deep learning model for vulnerability detection by at least 12% in terms of the F1 score, except for Transformer, which is 3. 9%.
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.
- Data and Information Security
- Intelligence Augmentation
- Machine Learning
- Software Engineering
- Performance Development
- Programming Techniques
Data Availability Statements
The authors confirm that the data supporting the findings of this study are available in the paper. Raw data supporting the findings of this study are available from the corresponding author upon reasonable request. We have released the codes and data in this research work along with the annotated data to facilitate further research at https://github.com/Yfeix/MPDA.
Notes
-
https://github.com/find-sec-bugs/juliet-test-suite
-
https://samate.nist.gov/SARD/test-suites
References
-
Abu-Mahfouz A, Alrabaee S, Khasawneh M, Gergely M, Choo K-KR (2024) A deep learning approach to discover router firmware vulnerabilities. IEEE Trans Industr Inf 20(1):691–702. https://doi.org/10.1109/TII.2023.3269774
-
Austin A, Holmgreen C, Williams L (2013) A comparison of the efficiency and effectiveness of vulnerability discovery techniques. Inf Softw Technol 55(7):1279–1288
-
Baset AZ, Denning T (2017) Ide plugins for detecting input-validation vulnerabilities. In: 2017 IEEE Security and Privacy Workshops (SPW). IEEE, pp 143–146
-
Bian P, Liang B, Shi W, Huang J, Cai Y (2018) Nar-miner: discovering negative association rules from code for bug detection. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pp 411–422. https://doi.org/10.1145/3236024.3236032
-
Byers R, Turner C, Brewer T (2022) National vulnerability database, national institute of standards and technology
-
Chahar C, Chauhan VS, Das ML (2012) Code analysis for software and system security using open source tools. Inf Secur J: A Glob Perspect 21(6):346–352
-
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
-
Croft R, Babar MA, Kholoosi MM (2023) Data quality for software vulnerability datasets. In: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pp 121–133. https://doi.org/10.1109/ICSE48619.2023.00022
-
Dong Z, Hu Q, Zhang Z, Zhao J (2024) On the effectiveness of graph data augmentation for source code learning. Knowl-Based Syst 285:111328. https://doi.org/10.1016/j.knosys.2023.111328
-
Duan LGQY, Yin H (2024) Sigmadiff: semantics-aware deep graph matching for pseudocode diffing. In: Network and Distributed System Security (NDSS) Symposium 2024, pp 1–19. https://dx.doi.org/10.14722/ndss.2024.23208
-
Du M, Li F, Zheng G, Srikumar V (2017) Deeplog: anomaly detection and diagnosis from system logs through deep learning. In: Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pp 1285–1298
-
Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L, Qin B, Liu T, Jiang D et al (2020) Codebert: a pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155
-
Ganz T, Imgrund E, Härterich M, Rieck K (2023) Codegraphsmote-data augmentation for vulnerability discovery. In: Data and applications security and privacy XXXVII, pp 282–301
-
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Commun ACM 63(11):139–144
-
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press,???. http://www.deeplearningbook.org
-
Grieco G, Grinblat GL, Uzal L, Rawat S, Feist J, Mounier L (2016) Toward large-scale vulnerability discovery using machine learning. In: Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy (CODASPY), pp 85–96. https://doi.org/10.1145/2857705.2857720
-
Han H, Wang W-Y, Mao B-H (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, pp 878–887
-
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). Ieee, pp 1322–1328
-
Hu P, Liang R, Cao Y, Chen K, Zhang R (2023) AURC: detecting errors in program code and documentation. In: 32nd USENIX Security Symposium (USENIX Security 23), pp 1415–1432. https://www.usenix.org/conference/usenixsecurity23/presentation/hu
-
Kanawati R (2014) Yasca: an ensemble-based approach for community detection in complex networks. In: Computing and combinatorics: 20th international conference, COCOON 2014, Atlanta, GA, USA, August 4-6, 2014. Proceedings 20. Springer, pp 657–666
-
Khalid H, Nagappan M, Hassan AE (2015) Examining the relationship between findbugs warnings and app ratings. IEEE Softw 33(4):34–39
-
Kota VR, Munisamy SD (2022) High accuracy offering attention mechanisms based deep learning approach using cnn/bi-lstm for sentiment analysis. Int J Intell Comput Cybern 15(1):61–74
-
Li Z, Zou D, Wang Z, Jin H (2019) Survey on static software vulnerability detection for source code. Chin J Netw Inf Secur 5(1):1–14
-
Li B, Hou Y, Che W (2022) Data augmentation approaches in natural language processing: a survey. Ai Open 3:71–90
-
Li Z, Zou D, Xu S, Jin H, Zhu Y, Chen Z (2022) Sysevr: a framework for using deep learning to detect software vulnerabilities. IEEE Trans Dependable Secure Comput 19(4):2244–2258
-
Liang C, Wei Q, Du J, Wang Y, Jiang Z (2025) Survey of source code vulnerability analysis based on deep learning. Comput Secur 148:104098. https://doi.org/10.1016/j.cose.2024.104098
-
Liang B, Bian P, Zhang Y, Shi W, You W, Cai Y (2016) Antminer: mining more bugs by reducing noise interference. In: 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), pp 333–344. https://doi.org/10.1145/2884781.2884870
-
Liang H, Wang L, Wu D, Xu J (2016) Mlsa: a static bugs analysis tool based on llvm ir. In: 2016 17th IEEE/ACIS international conference on software engineering, artificial intelligence, networking and parallel/distributed computing (SNPD). IEEE, pp 407–412
-
Lin W, Cai S (2021) An empirical study on vulnerability detection for source code software based on deep learning. In: 2021 IEEE 21st international conference on software quality, reliability and security companion (QRS-C), pp 1159–1160. https://doi.org/10.1109/QRS-C55045.2021.00173
-
Li Z, Zou D, Xu S, Ou X, Jin H, Wang S, Deng Z, Zhong Y (2018) Vuldeepecker: a deep learning-based system for vulnerability detection. arXiv preprint arXiv:1801.01681
-
Luo Y, Xu W, Xu D (2022) Compact abstract graphs for detecting code vulnerability with gnn models. In: Proceedings of the 38th annual computer security applications conference, pp 497–507
-
Marcilio D, Furia CA, Bonifácio R, Pinto G (2020) Spongebugs: automatically generating fix suggestions in response to static code analysis warnings. J Syst Softw 168:110671
-
Mianxue G, Hongyu S, Dan H, Su Y, Wanying C, Zhen G, Chunjie C, Wenjie W, Yuqing Z (2021) Software security vulnerability mining based on deep learning. Comput Res Dev 58(10):2140–2162. https://doi.org/10.7544/issn1000-1239.2021.20210620
-
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
-
Napier K, Bhowmik T, Wang S (2023) An empirical study of text-based machine learning models for vulnerability detection. Empir Softw Eng 38:28. https://doi.org/10.1007/s10664-022-10276-6
-
Partenza G, Amburgey T, Deng L, Dehlinger J, Chakraborty S (2021) Automatic identification of vulnerable code: investigations with an ast-based neural network. In: 2021 IEEE 45th annual computers, software, and applications conference (COMPSAC). IEEE, pp 1475–1482
-
Pinku SN, Mondal D, Roy CK (2023) Pathways to leverage transcompiler based data augmentation for cross-language clone detection. In: 2023 IEEE/ACM 31st International Conference on Program Comprehension (ICPC), pp 169–180. https://doi.org/10.1109/ICPC58990.2023.00031
-
Pradhan T, Bhatia C, Kumar P, Pal S (2021) A deep neural architecture based meta-review generation and final decision prediction of a scholarly article. Neurocomputing 428:218–238
-
Saccente N, Dehlinger J, Deng L, Chakraborty S, Xiong Y (2019) Project achilles: a prototype tool for static method-level vulnerability detection of java source code using a recurrent neural network. In: 2019 34th IEEE/ACM international conference on Automated Software Engineering Workshop (ASEW). IEEE, pp 114–121
-
Semasaba AOA, Zheng W, Wu X, Agyemang SA (2020) Literature survey of deep learning-based vulnerability analysis on source code. IET Softw 14:654–664. https://doi.org/10.1049/iet-sen.2020.0084
-
Shahriar H, Zulkernine M (2012) Mitigating program security vulnerabilities: approaches and challenges. ACM Comput Surv (CSUR) 44(3):1–46
-
Shen G, Tan Q, Zhang H, Zeng P, Xu J (2018) Deep learning with gated recurrent unit networks for financial sequence predictions. Procedia Comput Sci 131:895–903
-
Shiri Harzevili N, Boaye Belle A, Wang J, Wang S, Jiang ZMJ, Nagappan N (2024) A systematic literature review on automated software vulnerability detection using machine learning. ACM Comput Surv 57(3)
-
Sun H, Cui L, Li L, Ding Z, Hao Z, Cui J, Liu P (2021) Vdsimilar: vulnerability detection based on code similarity of vulnerabilities and patches. Comput Secur 110:102417
-
Sun C, Shrivastava A, Singh S, Gupta A (2017) Revisiting unreasonable effectiveness of data in deep learning era. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 843–852. https://doi.org/10.1109/ICCV.2017.97
-
Tang Z, Hu Q, Hu Y, Kuang W, Chen J (2022)Sevuldet: a semantics-enhanced learnable vulnerability detector. In: 2022 52nd Annual IEEE/IFIP international conference on Dependable Systems and Networks (DSN), pp 150–162
-
Wang H, Ye G, Tang Z, Tan SH, Huang S, Fang D, Feng Y, Bian L, Wang Z (2020) Combining graph-based learning with automated data collection for code vulnerability detection. IEEE Trans Inf Forensics Secur 16:1943–1958
-
Wang M, Tao C, Guo H (2023) Lcvd: loop-oriented code vulnerability detection via graph neural network. J Syst Softw 202:111706. https://doi.org/10.1016/j.jss.2023.111706
-
Yang S, Dong C, Xiao Y, Cheng Y, Shi Z, Li Z, Sun L (2023) Asteria-pro: enhancing deep learning-based binary code similarity detection by incorporating domain knowledge. ACM Trans Softw Eng Methodol 33(1). https://doi.org/10.1145/3604611
-
Yedida R, Menzies T (2022b) On the value of oversampling for deep learning in software defect prediction. IEEE Trans Software Eng 48(8):3103–3116. https://doi.org/10.1109/TSE.2021.3079841
-
Yedida R, Menzies T (2022a) How to improve deep learning for software analytics: (a case study with code smell detection). In: Proceedings of the 19th international conference on mining software repositories, pp 156–166
-
Zhang D, Tian L, Hong M, Han F, Ren Y, Chen Y (2018) Combining convolution neural network and bidirectional gated recurrent unit for sentence semantic classification. IEEE Access 6:73750–73759
-
Zhang H, Bi Y, Guo H, Sun W, Li J (2021) Isvsf: intelligent vulnerability detection against java via sentence-level pattern exploring. IEEE Syst J 16(1):1032–1043
-
Zhang J, Wang X, Zhang H, Sun H, Wang K, Liu X (2019) A novel neural source code representation based on abstract syntax tree. In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp 783–794. https://doi.org/10.1109/ICSE.2019.00086
-
Zhang X, Zhou Y, Han T, Chen T (2021) Training deep code comment generation models via data augmentation. In: Proceedings of the 12th Asia-Pacific symposium on internetware, pp 185–188. https://doi.org/10.1145/3457913.3457937
-
Zimmermann T, Nagappan N, Williams L (2010) Searching for a needle in a haystack: predicting security vulnerabilities for windows vista. In: 2010 third international conference on software testing, verification and validation, pp 421–428. https://doi.org/10.1109/ICST.2010.32
Funding
This research was supported by the China National Natural Science Foundation (62176164), the Guangdong Province Natural Science Foundation (Grant 2023A1515010992) and the Shenzhen Science and Technology Foundation (JCYJ20220531101217039 and JCYJ20210324093212034).
Ethics declarations
Conflict of Interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ethical Approval
Not applicable.
Informed Consent
All datasets used in this study are publicly available from the Open Web Application Security Project (OWASP) community repository on GitHub. In addition, the software and libraries used in this research are free and open source. No human or animal subjects were involved in this study.
Clinical trial number
Not applicable.
Additional information
Communicated by: Tegawendé F. Bissyandé.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
Reprints and permissions
About this article
Cite this article
Mao, F., Yuan, Y., Du, X. et al. MPDA: a data augmentation approach to improve deep learning for software vulnerability detection. Empir Software Eng 30, 140 (2025). https://doi.org/10.1007/s10664-025-10698-y
- Accepted
- Published
- DOI https://doi.org/10.1007/s10664-025-10698-y
Keywords
- Deep learning
- Data augmentation
- Software vulnerability detection
- Data imbalance