Article Content

Abstract

In the past few years, the advancement of deep learning based AI technologies has accelerated the research on automatic software vulnerability detection. However, deep learning models trained with vulnerability data have limited learning ability with high rates of false negative (FN) and false positive (FP), which are mainly caused by small and imbalanced sample problems of vulnerability data. Therefore, we propose a multiperspective data augmentation approach (called MPDA) and apply it to enhance data quality to improve deep learning for software vulnerability detection. MPDA automatically augments software vulnerability data from different perspectives by its three augmenting components designed as augmenting by oversamplingaugmenting by GAN, and augmenting by fuzzy sampling. We also design three algorithms, the Oversampling Strategy Selection (OSS) algorithm, the GAN-based data generating algorithm, and the Fuzzy Sampling Strategy Selection (FSS) algorithm, to help MPDA automatically achieve the optimal augmentation effect. The evaluation results on the Juliet Java Suite dataset and the application of MPDA to five widely used models in deep learning-based vulnerability detection to detect 29 types of vulnerability demonstrated that our approach consistently improved the performance of each deep learning model for vulnerability detection by at least 12% in terms of the F1 score, except for Transformer, which is 3. 9%.

Explore related subjects

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

  • Data and Information Security
  • Intelligence Augmentation
  • Machine Learning
  • Software Engineering
  • Performance Development
  • Programming Techniques

Data Availability Statements

The authors confirm that the data supporting the findings of this study are available in the paper. Raw data supporting the findings of this study are available from the corresponding author upon reasonable request. We have released the codes and data in this research work along with the annotated data to facilitate further research at https://github.com/Yfeix/MPDA.

Notes

  1. https://github.com/find-sec-bugs/juliet-test-suite

  2. https://samate.nist.gov/SARD/test-suites

References

  • Abu-Mahfouz A, Alrabaee S, Khasawneh M, Gergely M, Choo K-KR (2024) A deep learning approach to discover router firmware vulnerabilities. IEEE Trans Industr Inf 20(1):691–702. https://doi.org/10.1109/TII.2023.3269774

    Article Google Scholar

  • Austin A, Holmgreen C, Williams L (2013) A comparison of the efficiency and effectiveness of vulnerability discovery techniques. Inf Softw Technol 55(7):1279–1288

    Google Scholar

  • Baset AZ, Denning T (2017) Ide plugins for detecting input-validation vulnerabilities. In: 2017 IEEE Security and Privacy Workshops (SPW). IEEE, pp 143–146

  • Bian P, Liang B, Shi W, Huang J, Cai Y (2018) Nar-miner: discovering negative association rules from code for bug detection. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pp 411–422. https://doi.org/10.1145/3236024.3236032

  • Byers R, Turner C, Brewer T (2022) National vulnerability database, national institute of standards and technology

  • Chahar C, Chauhan VS, Das ML (2012) Code analysis for software and system security using open source tools. Inf Secur J: A Glob Perspect 21(6):346–352

    Google Scholar

  • Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Google Scholar

  • Croft R, Babar MA, Kholoosi MM (2023) Data quality for software vulnerability datasets. In: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pp 121–133. https://doi.org/10.1109/ICSE48619.2023.00022

  • Dong Z, Hu Q, Zhang Z, Zhao J (2024) On the effectiveness of graph data augmentation for source code learning. Knowl-Based Syst 285:111328. https://doi.org/10.1016/j.knosys.2023.111328

    Article Google Scholar

  • Duan LGQY, Yin H (2024) Sigmadiff: semantics-aware deep graph matching for pseudocode diffing. In: Network and Distributed System Security (NDSS) Symposium 2024, pp 1–19. https://dx.doi.org/10.14722/ndss.2024.23208

  • Du M, Li F, Zheng G, Srikumar V (2017) Deeplog: anomaly detection and diagnosis from system logs through deep learning. In: Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pp 1285–1298

  • Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L, Qin B, Liu T, Jiang D et al (2020) Codebert: a pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155

  • Ganz T, Imgrund E, Härterich M, Rieck K (2023) Codegraphsmote-data augmentation for vulnerability discovery. In: Data and applications security and privacy XXXVII, pp 282–301

  • Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Commun ACM 63(11):139–144

    MathSciNet Google Scholar

  • Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press,???. http://www.deeplearningbook.org

  • Grieco G, Grinblat GL, Uzal L, Rawat S, Feist J, Mounier L (2016) Toward large-scale vulnerability discovery using machine learning. In: Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy (CODASPY), pp 85–96. https://doi.org/10.1145/2857705.2857720

  • Han H, Wang W-Y, Mao B-H (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, pp 878–887

  • He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). Ieee, pp 1322–1328

  • Hu P, Liang R, Cao Y, Chen K, Zhang R (2023) AURC: detecting errors in program code and documentation. In: 32nd USENIX Security Symposium (USENIX Security 23), pp 1415–1432. https://www.usenix.org/conference/usenixsecurity23/presentation/hu

  • Kanawati R (2014) Yasca: an ensemble-based approach for community detection in complex networks. In: Computing and combinatorics: 20th international conference, COCOON 2014, Atlanta, GA, USA, August 4-6, 2014. Proceedings 20. Springer, pp 657–666

  • Khalid H, Nagappan M, Hassan AE (2015) Examining the relationship between findbugs warnings and app ratings. IEEE Softw 33(4):34–39

    Google Scholar

  • Kota VR, Munisamy SD (2022) High accuracy offering attention mechanisms based deep learning approach using cnn/bi-lstm for sentiment analysis. Int J Intell Comput Cybern 15(1):61–74

    Google Scholar

  • Li Z, Zou D, Wang Z, Jin H (2019) Survey on static software vulnerability detection for source code. Chin J Netw Inf Secur 5(1):1–14

    Google Scholar

  • Li B, Hou Y, Che W (2022) Data augmentation approaches in natural language processing: a survey. Ai Open 3:71–90

    Google Scholar

  • Li Z, Zou D, Xu S, Jin H, Zhu Y, Chen Z (2022) Sysevr: a framework for using deep learning to detect software vulnerabilities. IEEE Trans Dependable Secure Comput 19(4):2244–2258

    Google Scholar

  • Liang C, Wei Q, Du J, Wang Y, Jiang Z (2025) Survey of source code vulnerability analysis based on deep learning. Comput Secur 148:104098. https://doi.org/10.1016/j.cose.2024.104098

    Article Google Scholar

  • Liang B, Bian P, Zhang Y, Shi W, You W, Cai Y (2016) Antminer: mining more bugs by reducing noise interference. In: 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), pp 333–344. https://doi.org/10.1145/2884781.2884870

  • Liang H, Wang L, Wu D, Xu J (2016) Mlsa: a static bugs analysis tool based on llvm ir. In: 2016 17th IEEE/ACIS international conference on software engineering, artificial intelligence, networking and parallel/distributed computing (SNPD). IEEE, pp 407–412

  • Lin W, Cai S (2021) An empirical study on vulnerability detection for source code software based on deep learning. In: 2021 IEEE 21st international conference on software quality, reliability and security companion (QRS-C), pp 1159–1160. https://doi.org/10.1109/QRS-C55045.2021.00173

  • Li Z, Zou D, Xu S, Ou X, Jin H, Wang S, Deng Z, Zhong Y (2018) Vuldeepecker: a deep learning-based system for vulnerability detection. arXiv preprint arXiv:1801.01681

  • Luo Y, Xu W, Xu D (2022) Compact abstract graphs for detecting code vulnerability with gnn models. In: Proceedings of the 38th annual computer security applications conference, pp 497–507

  • Marcilio D, Furia CA, Bonifácio R, Pinto G (2020) Spongebugs: automatically generating fix suggestions in response to static code analysis warnings. J Syst Softw 168:110671

    Google Scholar

  • Mianxue G, Hongyu S, Dan H, Su Y, Wanying C, Zhen G, Chunjie C, Wenjie W, Yuqing Z (2021) Software security vulnerability mining based on deep learning. Comput Res Dev 58(10):2140–2162. https://doi.org/10.7544/issn1000-1239.2021.20210620

    Article Google Scholar

  • Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

  • Napier K, Bhowmik T, Wang S (2023) An empirical study of text-based machine learning models for vulnerability detection. Empir Softw Eng 38:28. https://doi.org/10.1007/s10664-022-10276-6

    Article Google Scholar

  • Partenza G, Amburgey T, Deng L, Dehlinger J, Chakraborty S (2021) Automatic identification of vulnerable code: investigations with an ast-based neural network. In: 2021 IEEE 45th annual computers, software, and applications conference (COMPSAC). IEEE, pp 1475–1482

  • Pinku SN, Mondal D, Roy CK (2023) Pathways to leverage transcompiler based data augmentation for cross-language clone detection. In: 2023 IEEE/ACM 31st International Conference on Program Comprehension (ICPC), pp 169–180. https://doi.org/10.1109/ICPC58990.2023.00031

  • Pradhan T, Bhatia C, Kumar P, Pal S (2021) A deep neural architecture based meta-review generation and final decision prediction of a scholarly article. Neurocomputing 428:218–238

    Google Scholar

  • Saccente N, Dehlinger J, Deng L, Chakraborty S, Xiong Y (2019) Project achilles: a prototype tool for static method-level vulnerability detection of java source code using a recurrent neural network. In: 2019 34th IEEE/ACM international conference on Automated Software Engineering Workshop (ASEW). IEEE, pp 114–121

  • Semasaba AOA, Zheng W, Wu X, Agyemang SA (2020) Literature survey of deep learning-based vulnerability analysis on source code. IET Softw 14:654–664. https://doi.org/10.1049/iet-sen.2020.0084

    Article Google Scholar

  • Shahriar H, Zulkernine M (2012) Mitigating program security vulnerabilities: approaches and challenges. ACM Comput Surv (CSUR) 44(3):1–46

    Google Scholar

  • Shen G, Tan Q, Zhang H, Zeng P, Xu J (2018) Deep learning with gated recurrent unit networks for financial sequence predictions. Procedia Comput Sci 131:895–903

    Google Scholar

  • Shiri Harzevili N, Boaye Belle A, Wang J, Wang S, Jiang ZMJ, Nagappan N (2024) A systematic literature review on automated software vulnerability detection using machine learning. ACM Comput Surv 57(3)

  • Sun H, Cui L, Li L, Ding Z, Hao Z, Cui J, Liu P (2021) Vdsimilar: vulnerability detection based on code similarity of vulnerabilities and patches. Comput Secur 110:102417

    Google Scholar

  • Sun C, Shrivastava A, Singh S, Gupta A (2017) Revisiting unreasonable effectiveness of data in deep learning era. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 843–852. https://doi.org/10.1109/ICCV.2017.97

  • Tang Z, Hu Q, Hu Y, Kuang W, Chen J (2022)Sevuldet: a semantics-enhanced learnable vulnerability detector. In: 2022 52nd Annual IEEE/IFIP international conference on Dependable Systems and Networks (DSN), pp 150–162

  • Wang H, Ye G, Tang Z, Tan SH, Huang S, Fang D, Feng Y, Bian L, Wang Z (2020) Combining graph-based learning with automated data collection for code vulnerability detection. IEEE Trans Inf Forensics Secur 16:1943–1958

    Google Scholar

  • Wang M, Tao C, Guo H (2023) Lcvd: loop-oriented code vulnerability detection via graph neural network. J Syst Softw 202:111706. https://doi.org/10.1016/j.jss.2023.111706

    Article Google Scholar

  • Yang S, Dong C, Xiao Y, Cheng Y, Shi Z, Li Z, Sun L (2023) Asteria-pro: enhancing deep learning-based binary code similarity detection by incorporating domain knowledge. ACM Trans Softw Eng Methodol 33(1). https://doi.org/10.1145/3604611

  • Yedida R, Menzies T (2022b) On the value of oversampling for deep learning in software defect prediction. IEEE Trans Software Eng 48(8):3103–3116. https://doi.org/10.1109/TSE.2021.3079841

    Article Google Scholar

  • Yedida R, Menzies T (2022a) How to improve deep learning for software analytics: (a case study with code smell detection). In: Proceedings of the 19th international conference on mining software repositories, pp 156–166

  • Zhang D, Tian L, Hong M, Han F, Ren Y, Chen Y (2018) Combining convolution neural network and bidirectional gated recurrent unit for sentence semantic classification. IEEE Access 6:73750–73759

    Google Scholar

  • Zhang H, Bi Y, Guo H, Sun W, Li J (2021) Isvsf: intelligent vulnerability detection against java via sentence-level pattern exploring. IEEE Syst J 16(1):1032–1043

    Google Scholar

  • Zhang J, Wang X, Zhang H, Sun H, Wang K, Liu X (2019) A novel neural source code representation based on abstract syntax tree. In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp 783–794. https://doi.org/10.1109/ICSE.2019.00086

  • Zhang X, Zhou Y, Han T, Chen T (2021) Training deep code comment generation models via data augmentation. In: Proceedings of the 12th Asia-Pacific symposium on internetware, pp 185–188. https://doi.org/10.1145/3457913.3457937

  • Zimmermann T, Nagappan N, Williams L (2010) Searching for a needle in a haystack: predicting security vulnerabilities for windows vista. In: 2010 third international conference on software testing, verification and validation, pp 421–428. https://doi.org/10.1109/ICST.2010.32

Download references

Funding

This research was supported by the China National Natural Science Foundation (62176164), the Guangdong Province Natural Science Foundation (Grant 2023A1515010992) and the Shenzhen Science and Technology Foundation (JCYJ20220531101217039 and JCYJ20210324093212034).

Author information

Authors and Affiliations

Contributions

Feiqiao Mao made substantial contributions to the conception and design of the work, critically reviewed the draft and manuscript of the work for important intellectual content. Yingxiang Yuan and Xingyang Du made the acquisition, analysis, interpretation of data, and the creation of new software used in the work, and Yingxiang Yuan drafted the work. Feiqiao Mao and Yingxiang Yuan wrote the main manuscript text, and Xingyang Du prepared tables and figures and the latex sources files of the manuscript. Li Gao did part of the investigation and visualization work. Zhihua Du managed the project, provided resources, funding, and language editing. All authors reviewed and revised the manuscript.

Corresponding author

Correspondence to Feiqiao Mao.

Ethics declarations

Conflict of Interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical Approval

Not applicable.

Informed Consent

All datasets used in this study are publicly available from the Open Web Application Security Project (OWASP) community repository on GitHub. In addition, the software and libraries used in this research are free and open source. No human or animal subjects were involved in this study.

Clinical trial number

Not applicable.

Additional information

Communicated by: Tegawendé F. Bissyandé.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Mao, F., Yuan, Y., Du, X. et al. MPDA: a data augmentation approach to improve deep learning for software vulnerability detection. Empir Software Eng 30, 140 (2025). https://doi.org/10.1007/s10664-025-10698-y

Download citation

  • Accepted 
  • Published 
  • DOI  https://doi.org/10.1007/s10664-025-10698-y

Keywords

  • Deep learning
  • Data augmentation
  • Software vulnerability detection
  • Data imbalance
WhatsApp