MPDA: a data augmentation approach to improve deep learning for software vulnerability detection

Article Content

Abstract

In the past few years, the advancement of deep learning based AI technologies has accelerated the research on automatic software vulnerability detection. However, deep learning models trained with vulnerability data have limited learning ability with high rates of false negative (FN) and false positive (FP), which are mainly caused by small and imbalanced sample problems of vulnerability data. Therefore, we propose a multiperspective data augmentation approach (called MPDA) and apply it to enhance data quality to improve deep learning for software vulnerability detection. MPDA automatically augments software vulnerability data from different perspectives by its three augmenting components designed as augmenting by oversampling, augmenting by GAN, and augmenting by fuzzy sampling. We also design three algorithms, the Oversampling Strategy Selection (OSS) algorithm, the GAN-based data generating algorithm, and the Fuzzy Sampling Strategy Selection (FSS) algorithm, to help MPDA automatically achieve the optimal augmentation effect. The evaluation results on the Juliet Java Suite dataset and the application of MPDA to five widely used models in deep learning-based vulnerability detection to detect 29 types of vulnerability demonstrated that our approach consistently improved the performance of each deep learning model for vulnerability detection by at least 12% in terms of the F1 score, except for Transformer, which is 3. 9%.

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Data and Information Security
Intelligence Augmentation
Machine Learning
Software Engineering
Performance Development
Programming Techniques

Data Availability Statements

The authors confirm that the data supporting the findings of this study are available in the paper. Raw data supporting the findings of this study are available from the corresponding author upon reasonable request. We have released the codes and data in this research work along with the annotated data to facilitate further research at https://github.com/Yfeix/MPDA.

Notes

https://github.com/find-sec-bugs/juliet-test-suite
https://samate.nist.gov/SARD/test-suites

References

Abu-Mahfouz A, Alrabaee S, Khasawneh M, Gergely M, Choo K-KR (2024) A deep learning approach to discover router firmware vulnerabilities. IEEE Trans Industr Inf 20(1):691–702. https://doi.org/10.1109/TII.2023.3269774

Article Google Scholar
Austin A, Holmgreen C, Williams L (2013) A comparison of the efficiency and effectiveness of vulnerability discovery techniques. Inf Softw Technol 55(7):1279–1288

Google Scholar
Baset AZ, Denning T (2017) Ide plugins for detecting input-validation vulnerabilities. In: 2017 IEEE Security and Privacy Workshops (SPW). IEEE, pp 143–146
Bian P, Liang B, Shi W, Huang J, Cai Y (2018) Nar-miner: discovering negative association rules from code for bug detection. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pp 411–422. https://doi.org/10.1145/3236024.3236032
Byers R, Turner C, Brewer T (2022) National vulnerability database, national institute of standards and technology
Chahar C, Chauhan VS, Das ML (2012) Code analysis for software and system security using open source tools. Inf Secur J: A Glob Perspect 21(6):346–352

Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

Google Scholar
Croft R, Babar MA, Kholoosi MM (2023) Data quality for software vulnerability datasets. In: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pp 121–133. https://doi.org/10.1109/ICSE48619.2023.00022
Dong Z, Hu Q, Zhang Z, Zhao J (2024) On the effectiveness of graph data augmentation for source code learning. Knowl-Based Syst 285:111328. https://doi.org/10.1016/j.knosys.2023.111328

Article Google Scholar
Duan LGQY, Yin H (2024) Sigmadiff: semantics-aware deep graph matching for pseudocode diffing. In: Network and Distributed System Security (NDSS) Symposium 2024, pp 1–19. https://dx.doi.org/10.14722/ndss.2024.23208
Du M, Li F, Zheng G, Srikumar V (2017) Deeplog: anomaly detection and diagnosis from system logs through deep learning. In: Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pp 1285–1298
Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L, Qin B, Liu T, Jiang D et al (2020) Codebert: a pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155
Ganz T, Imgrund E, Härterich M, Rieck K (2023) Codegraphsmote-data augmentation for vulnerability discovery. In: Data and applications security and privacy XXXVII, pp 282–301
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Commun ACM 63(11):139–144

MathSciNet Google Scholar
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press,???. http://www.deeplearningbook.org
Grieco G, Grinblat GL, Uzal L, Rawat S, Feist J, Mounier L (2016) Toward large-scale vulnerability discovery using machine learning. In: Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy (CODASPY), pp 85–96. https://doi.org/10.1145/2857705.2857720
Han H, Wang W-Y, Mao B-H (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, pp 878–887
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). Ieee, pp 1322–1328
Hu P, Liang R, Cao Y, Chen K, Zhang R (2023) AURC: detecting errors in program code and documentation. In: 32nd USENIX Security Symposium (USENIX Security 23), pp 1415–1432. https://www.usenix.org/conference/usenixsecurity23/presentation/hu
Kanawati R (2014) Yasca: an ensemble-based approach for community detection in complex networks. In: Computing and combinatorics: 20th international conference, COCOON 2014, Atlanta, GA, USA, August 4-6, 2014. Proceedings 20. Springer, pp 657–666
Khalid H, Nagappan M, Hassan AE (2015) Examining the relationship between findbugs warnings and app ratings. IEEE Softw 33(4):34–39

Google Scholar
Kota VR, Munisamy SD (2022) High accuracy offering attention mechanisms based deep learning approach using cnn/bi-lstm for sentiment analysis. Int J Intell Comput Cybern 15(1):61–74

Google Scholar
Li Z, Zou D, Wang Z, Jin H (2019) Survey on static software vulnerability detection for source code. Chin J Netw Inf Secur 5(1):1–14

Google Scholar
Li B, Hou Y, Che W (2022) Data augmentation approaches in natural language processing: a survey. Ai Open 3:71–90

Google Scholar
Li Z, Zou D, Xu S, Jin H, Zhu Y, Chen Z (2022) Sysevr: a framework for using deep learning to detect software vulnerabilities. IEEE Trans Dependable Secure Comput 19(4):2244–2258

Google Scholar
Liang C, Wei Q, Du J, Wang Y, Jiang Z (2025) Survey of source code vulnerability analysis based on deep learning. Comput Secur 148:104098. https://doi.org/10.1016/j.cose.2024.104098

Article Google Scholar
Liang B, Bian P, Zhang Y, Shi W, You W, Cai Y (2016) Antminer: mining more bugs by reducing noise interference. In: 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), pp 333–344. https://doi.org/10.1145/2884781.2884870
Liang H, Wang L, Wu D, Xu J (2016) Mlsa: a static bugs analysis tool based on llvm ir. In: 2016 17th IEEE/ACIS international conference on software engineering, artificial intelligence, networking and parallel/distributed computing (SNPD). IEEE, pp 407–412
Lin W, Cai S (2021) An empirical study on vulnerability detection for source code software based on deep learning. In: 2021 IEEE 21st international conference on software quality, reliability and security companion (QRS-C), pp 1159–1160. https://doi.org/10.1109/QRS-C55045.2021.00173
Li Z, Zou D, Xu S, Ou X, Jin H, Wang S, Deng Z, Zhong Y (2018) Vuldeepecker: a deep learning-based system for vulnerability detection. arXiv preprint arXiv:1801.01681
Luo Y, Xu W, Xu D (2022) Compact abstract graphs for detecting code vulnerability with gnn models. In: Proceedings of the 38th annual computer security applications conference, pp 497–507
Marcilio D, Furia CA, Bonifácio R, Pinto G (2020) Spongebugs: automatically generating fix suggestions in response to static code analysis warnings. J Syst Softw 168:110671

Google Scholar
Mianxue G, Hongyu S, Dan H, Su Y, Wanying C, Zhen G, Chunjie C, Wenjie W, Yuqing Z (2021) Software security vulnerability mining based on deep learning. Comput Res Dev 58(10):2140–2162. https://doi.org/10.7544/issn1000-1239.2021.20210620

Article Google Scholar
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Napier K, Bhowmik T, Wang S (2023) An empirical study of text-based machine learning models for vulnerability detection. Empir Softw Eng 38:28. https://doi.org/10.1007/s10664-022-10276-6

Article Google Scholar
Partenza G, Amburgey T, Deng L, Dehlinger J, Chakraborty S (2021) Automatic identification of vulnerable code: investigations with an ast-based neural network. In: 2021 IEEE 45th annual computers, software, and applications conference (COMPSAC). IEEE, pp 1475–1482
Pinku SN, Mondal D, Roy CK (2023) Pathways to leverage transcompiler based data augmentation for cross-language clone detection. In: 2023 IEEE/ACM 31st International Conference on Program Comprehension (ICPC), pp 169–180. https://doi.org/10.1109/ICPC58990.2023.00031
Pradhan T, Bhatia C, Kumar P, Pal S (2021) A deep neural architecture based meta-review generation and final decision prediction of a scholarly article. Neurocomputing 428:218–238

Google Scholar
Saccente N, Dehlinger J, Deng L, Chakraborty S, Xiong Y (2019) Project achilles: a prototype tool for static method-level vulnerability detection of java source code using a recurrent neural network. In: 2019 34th IEEE/ACM international conference on Automated Software Engineering Workshop (ASEW). IEEE, pp 114–121
Semasaba AOA, Zheng W, Wu X, Agyemang SA (2020) Literature survey of deep learning-based vulnerability analysis on source code. IET Softw 14:654–664. https://doi.org/10.1049/iet-sen.2020.0084

Article Google Scholar
Shahriar H, Zulkernine M (2012) Mitigating program security vulnerabilities: approaches and challenges. ACM Comput Surv (CSUR) 44(3):1–46

Google Scholar
Shen G, Tan Q, Zhang H, Zeng P, Xu J (2018) Deep learning with gated recurrent unit networks for financial sequence predictions. Procedia Comput Sci 131:895–903

Google Scholar
Shiri Harzevili N, Boaye Belle A, Wang J, Wang S, Jiang ZMJ, Nagappan N (2024) A systematic literature review on automated software vulnerability detection using machine learning. ACM Comput Surv 57(3)
Sun H, Cui L, Li L, Ding Z, Hao Z, Cui J, Liu P (2021) Vdsimilar: vulnerability detection based on code similarity of vulnerabilities and patches. Comput Secur 110:102417

Google Scholar
Sun C, Shrivastava A, Singh S, Gupta A (2017) Revisiting unreasonable effectiveness of data in deep learning era. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 843–852. https://doi.org/10.1109/ICCV.2017.97
Tang Z, Hu Q, Hu Y, Kuang W, Chen J (2022)Sevuldet: a semantics-enhanced learnable vulnerability detector. In: 2022 52nd Annual IEEE/IFIP international conference on Dependable Systems and Networks (DSN), pp 150–162
Wang H, Ye G, Tang Z, Tan SH, Huang S, Fang D, Feng Y, Bian L, Wang Z (2020) Combining graph-based learning with automated data collection for code vulnerability detection. IEEE Trans Inf Forensics Secur 16:1943–1958

Google Scholar
Wang M, Tao C, Guo H (2023) Lcvd: loop-oriented code vulnerability detection via graph neural network. J Syst Softw 202:111706. https://doi.org/10.1016/j.jss.2023.111706

Article Google Scholar
Yang S, Dong C, Xiao Y, Cheng Y, Shi Z, Li Z, Sun L (2023) Asteria-pro: enhancing deep learning-based binary code similarity detection by incorporating domain knowledge. ACM Trans Softw Eng Methodol 33(1). https://doi.org/10.1145/3604611
Yedida R, Menzies T (2022b) On the value of oversampling for deep learning in software defect prediction. IEEE Trans Software Eng 48(8):3103–3116. https://doi.org/10.1109/TSE.2021.3079841

Article Google Scholar
Yedida R, Menzies T (2022a) How to improve deep learning for software analytics: (a case study with code smell detection). In: Proceedings of the 19th international conference on mining software repositories, pp 156–166
Zhang D, Tian L, Hong M, Han F, Ren Y, Chen Y (2018) Combining convolution neural network and bidirectional gated recurrent unit for sentence semantic classification. IEEE Access 6:73750–73759

Google Scholar
Zhang H, Bi Y, Guo H, Sun W, Li J (2021) Isvsf: intelligent vulnerability detection against java via sentence-level pattern exploring. IEEE Syst J 16(1):1032–1043

Google Scholar
Zhang J, Wang X, Zhang H, Sun H, Wang K, Liu X (2019) A novel neural source code representation based on abstract syntax tree. In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp 783–794. https://doi.org/10.1109/ICSE.2019.00086
Zhang X, Zhou Y, Han T, Chen T (2021) Training deep code comment generation models via data augmentation. In: Proceedings of the 12th Asia-Pacific symposium on internetware, pp 185–188. https://doi.org/10.1145/3457913.3457937
Zimmermann T, Nagappan N, Williams L (2010) Searching for a needle in a haystack: predicting security vulnerabilities for windows vista. In: 2010 third international conference on software testing, verification and validation, pp 421–428. https://doi.org/10.1109/ICST.2010.32

Download references

Funding

This research was supported by the China National Natural Science Foundation (62176164), the Guangdong Province Natural Science Foundation (Grant 2023A1515010992) and the Shenzhen Science and Technology Foundation (JCYJ20220531101217039 and JCYJ20210324093212034).

Author information

Authors and Affiliations

College of Computer Science and Software Engineering, Shenzhen University, ShenZhen, China

Feiqiao Mao, Yingxiang Yuan, Xingyang Du, Li Gao & Zhihua Du

Contributions

Feiqiao Mao made substantial contributions to the conception and design of the work, critically reviewed the draft and manuscript of the work for important intellectual content. Yingxiang Yuan and Xingyang Du made the acquisition, analysis, interpretation of data, and the creation of new software used in the work, and Yingxiang Yuan drafted the work. Feiqiao Mao and Yingxiang Yuan wrote the main manuscript text, and Xingyang Du prepared tables and figures and the latex sources files of the manuscript. Li Gao did part of the investigation and visualization work. Zhihua Du managed the project, provided resources, funding, and language editing. All authors reviewed and revised the manuscript.

Corresponding author

Correspondence to Feiqiao Mao.

Ethics declarations

Conflict of Interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical Approval

Not applicable.

Informed Consent

All datasets used in this study are publicly available from the Open Web Application Security Project (OWASP) community repository on GitHub. In addition, the software and libraries used in this research are free and open source. No human or animal subjects were involved in this study.

Clinical trial number

Not applicable.

Additional information

Communicated by: Tegawendé F. Bissyandé.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Mao, F., Yuan, Y., Du, X. et al. MPDA: a data augmentation approach to improve deep learning for software vulnerability detection. Empir Software Eng 30, 140 (2025). https://doi.org/10.1007/s10664-025-10698-y

Download citation

Accepted 01 July 2025
Published 14 July 2025
DOI https://doi.org/10.1007/s10664-025-10698-y

Keywords

Deep learning
Data augmentation
Software vulnerability detection
Data imbalance

Related Articles

Contact us