Article Content
Abstract
The adoption of Machine Learning (ML)-enabled systems is growing rapidly, introducing novel challenges in maintaining quality and managing technical debt in these complex systems. Among the key quality threats are ML-specific code smells (ML-CSs), suboptimal implementation practices in ML pipelines that can compromise system performance, reliability, and maintainability. Although these smells have been defined in the literature, detailed insights into their characteristics, evolution, and mitigation strategies are still needed to help developers address these quality issues effectively. In this paper, we investigate the emergence and evolution of ML-CSs through a large-scale empirical study focusing on (i) their prevalence in real ML-enabled systems, (ii) how they are introduced and removed, and (iii) their survivability. We analyze over 400,000 commits from 337 ML-enabled projects, leveraging CodeSmile, a novel ML smell detector that we developed to enable our investigation and identify ML-specific code smells. Our results reveal that: (1) CodeSmile can detect ML-CSs with precision and recall rates of 87.4% and 78.6%, respectively; (2) ML-CSs are frequently introduced during file modifications in new feature tasks; (3) smells are typically removed during tasks related to new features, enhancements, or refactoring; and (4) the majority of ML-CSs are resolved within the first 10% of commits. Based on these findings, we provide actionable conclusions and insights to guide future research and quality assurance practices for ML-enabled systems.
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.
- Machine Learning
- Open Source
- Sensory Evaluation
- Statistical Learning
- Symbolic AI
- Technical Languages
Data Availability Statement
The manuscript includes data as electronic supplementary material. In particular, datasets generated and analyzed during the current study, detailed results, scripts, and additional resources useful for reproducing the study are available as part of our online appendix on Figshare (Recupito et al. 2024). In addition, we included the GitHub repository link for CodeSmile: https://github.com/giammariagiordano/smell_ai/tree/main.
Notes
-
https://chat.openai.com/
-
Source code available at: https://github.com/huggingface/transformers/blob/main/examples/research_projects/bertology/run_bertology.py
-
Available at https://github.com/acmsigsoft/EmpiricalStandards
-
Qualtrics sample size: https://www.qualtrics.com/blog/calculating-sample-size/
-
Case of a ‘Columns and DataType Not Explicitly Set’ smell-introducing commit when performing refactoring in BrikerMan/Kashgari project: https://github.com/BrikerMan/Kashgari/commit/f7fb43d2f3651fbba92eb6e5cee8bfd279b0317a
-
Case of a ‘Columns and DataType Not Explicitly Set’ smell-removing commit when performing enhancement in RTIInternational/gobbli project:https://github.com/RTIInternational/gobbli/commit/b93d184c610c3ae779607679501b4b1dafd30b28
-
Example of a smell-removing commit aware of the ML-CSs: https://github.com/geyang/ml-logger/commit/25aff14cf101ac4db06be02e65d845c54fc38c84
References
-
Azeem MI, Palomba F, Shi L, Wang Q (2019) Machine learning techniques for code smell detection: A systematic literature review and meta-analysis. Inf Softw Technol 108:115–138
-
Basili VR, Caldiera G, Rombach HD (1994) The goal question metric approach. Encyclopedia of software engineering pp 528–532
-
Bessghaier N, Ouni A, Mkaouer MW (2020) On the diffusion and impact of code smells in web applications. In: Wang Q, Xia Y, Seshadri S, Zhang LJ (eds) Services Computing – SCC 2020. Springer International Publishing, Cham, pp 67–84
-
Cardozo N, Dusparic I, Cabrera C (2023) Prevalence of code smells in reinforcement learning projects
-
Cliff N (1993) Dominance statistics: Ordinal analyses to answer ordinal questions. Psychol Bull 114:494–509. https://api.semanticscholar.org/CorpusID:120113824
-
Conover WJ (1999) Practical nonparametric statistics, vol 350. john wiley & sons
-
Costal D, Gómez C, Martínez-Fernández S (2024) Metrics for code smells of ml pipelines. In: Kadgien R, Jedlitschka A, Janes A, Lenarduzzi V, Li X (eds) Product-focused software process improvement. Springer Nature Switzerland, Cham, pp 3–9
-
Cunningham W (1992) The wycash portfolio management system. ACM Sigplan Oops Messenger 4(2):29–30
-
Fowler M (2018) Refactoring. Addison-Wesley Professional
-
Fowler M, Beck K (1997) Refactoring: Improving the design of existing code. In: 11th European conference. Jyväskylä, Finland
-
Giordano G, Annunziata G, De Lucia A, Palomba F et al (2023) Understanding developer practices and code smells diffusion in ai-enabled software: A preliminary study. In: IWSM-Mensura
-
Giordano G, Fasulo A, Catolino G, Palomba F, Ferrucci F, Gravino C (2022) On the evolution of inheritance and delegation mechanisms and their impact on code quality. In: 2022 IEEE International conference on software analysis, evolution and reengineering (SANER), pp 947–958. IEEE
-
Giordano G, Fasulo A, Catolino G, Palomba F, Ferrucci F, Gravino C (2022) On the evolution of inheritance and delegation mechanisms and their impact on code quality. In: 2022 IEEE International conference on software analysis, evolution and reengineering (SANER), pp 947–958
-
Giordano G, Sellitto G, Sepe A, Palomba F, Ferrucci F (2023) The yin and yang of software quality: On the relationship between design patterns and code smells. In: 2023 49th Euromicro conference on software engineering and advanced applications (SEAA), pp 227–234. IEEE
-
Khomh F, Penta MD, Guéhéneuc YG, Antoniol G (2012) An exploratory study of the impact of antipatterns on class change-and fault-proneness. Empir Softw Eng 17:243–275
-
Lehman MM, Ramil JF, Wernick PD, Perry DE, Turski WM (1997) Metrics and laws of software evolution-the nineties view. In: Proceedings fourth international software metrics symposium, pp 20–32. IEEE
-
Lenarduzzi V, Lomio F, Moreschini S, Taibi D, Tamburri DA (2021) Software quality for ai: Where we are now? In: Software quality: future perspectives on software engineering quality: 13th international conference, SWQD 2021, Vienna, Austria, January 19–21, 2021, Proceedings 13, pp 43–53. Springer
-
Martínez-Fernández S, Bogner J, Franch X, Oriol M, Siebert J, Trendowicz A, Vollmer AM, Wagner S (2022) Software engineering for ai-based systems: a survey. ACM Trans Softw Eng Methodol (TOSEM) 31(2):1–59
-
Murphy-Hill E, Black AP (2007) Why don’t people use refactoring tools? In: Proceedings of the 1st workshop on refactoring tools, pp 61–62
-
Palomba F, Bavota G, Di Penta M, Fasano F, Oliveto R, De Lucia A (2018) On the diffuseness and the impact on maintainability of code smells: a large scale empirical investigation. In: Proceedings of the 40th international conference on software engineering, pp 482–482
-
Palomba F, Bavota G, Di Penta M, Oliveto R, De Lucia A (2014) Do they really smell bad? a study on developers’ perception of bad code smells. In: 2014 IEEE International conference on software maintenance and evolution, pp 101–110. IEEE
-
Palomba F, Panichella A, Zaidman A, Oliveto R, De Lucia A (2018) The scent of a smell: An extensive comparison between textual and structural smells. IEEE Trans Softw Eng 44(10):977–1000. https://doi.org/10.1109/TSE.2017.2752171
-
de Paulo Sobrinho EV, De Lucia A, de Almeida Maia M (2018) A systematic literature review on bad smells-5 w’s: which, when, what, who, where. IEEE Trans Softw Eng 47(1):17–66
-
Ratzinger J, Sigmund T, Gall HC (2008) On the relation of refactorings and software defect prediction. In: Proceedings of the 2008 international working conference on Mining software repositories, pp 35–38
-
Recupito G, Giordano G, Ferrucci F, Di Nucci D, Palomba F (2024). When code smells meet ml: On the lifecycle of ml-specific code smells in ml-enabled systems – appendix. https://doi.org/10.6084/m9.figshare.28167065
-
Recupito G, Pecorelli F, Catolino G, Lenarduzzi V, Taibi D, Di Nucci D, Palomba F (2024) Technical debt in ai-enabled systems: On the prevalence, severity, impact, and management strategies for code and architecture. J Syst Softw 216:112151
-
Rhmann W (2021) Quantitative software change prediction in open source web projects using time series forecasting. Int J Open Source Softw Processes (IJOSSP) 12(2):36–51
-
Riquet N, Devroey X, Vanderose B (2022) Gitdelver enterprise dataset (gded) an industrial closed-source dataset for socio-technical research. In: Proceedings of the 19th international conference on mining software repositories, pp 403–407
-
Rosner B, Glynn RJ (2009) Power and sample size estimation for the wilcoxon rank sum test with application to comparisons of c statistics from alternative prediction models. Biometrics 65(1):188–197. https://doi.org/10.1111/j.1541-0420.2008.01062.x
-
Sculley D, Holt G, Golovin D, Davydov E, Phillips T, Ebner D, Chaudhary V, Young M, Crespo JF, Dennison D (2015) Hidden technical debt in machine learning systems. Advances in neural information processing systems 28
-
Spadini D, Aniche M, Bacchelli A (2018) Pydriller: Python framework for mining software repositories. In: Proceedings of the 2018 26th ACM Joint meeting on european software engineering conference and symposium on the foundations of software engineering, pp 908–911
-
Taibi D, Janes A, Lenarduzzi V (2017) How developers perceive smells in source code: A replicated study. Inf Softw Technol 92:223–235
-
Tang Y, Khatchadourian R, Bagherzadeh M, Singh R, Stewart A, Raja A (2021) An empirical study of refactorings and technical debt in machine learning systems. In: 2021 IEEE/ACM 43rd international conference on software engineering (ICSE), pp 238–250. IEEE
-
Tufano M, Palomba F, Bavota G, Oliveto R, Di Penta M, De Lucia A, Poshyvanyk D (2017) When and why your code starts to smell bad (and whether the smells go away). IEEE Trans Softw Eng 43(11):1063–1088
-
Tufano M, Palomba F, Bavota G, Oliveto R, Di Penta M, De Lucia A, Poshyvanyk D (2017) When and why your code starts to smell bad (and whether the smells go away). IEEE Trans Softw Eng 43(11):1063–1088
-
Van Oort B, Cruz L, Aniche M, Van Deursen A (2021) The prevalence of code smells in machine learning projects. In: 2021 IEEE/ACM 1st Workshop on AI engineering-software engineering for AI (WAIN), pp. 1–8. IEEE
-
Walter B, Alkhaeir T (2016) The relationship between design patterns and code smells: An exploratory study. Inf Softw Technol 74:127–142
-
Wang G, Wang Z, Chen J, Chen X, Yan M (2022) An empirical study on numerical bugs in deep learning programs. In: Proceedings of the 37th IEEE/ACM international conference on automated software engineering, pp 1–5
-
Widyasari R, Yang Z, Thung F, Qin Sim S, Wee F, Lok C, Phan J, Qi H, Tan C, Tay Q, Lo D (2023) Niche: A curated dataset of engineered machine learning projects in python. In: 2023 IEEE/ACM 20th International conference on mining software repositories (MSR), pp 62–66
-
Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A (2012) Experimentation in software engineering. Springer Science & Business Media
-
Zhang H, Cruz L, Van Deursen A (2022) Code smells for machine learning applications. In: Proceedings of the 1st international conference on AI engineering: software engineering for AI, pp 217–228
-
Zhou Y, Leung H, Xu B (2009) Examining the potentially confounding effect of class size on the associations between object-oriented metrics and change-proneness. IEEE Trans Softw Eng 35(5):607–623
Acknowledgements
This work has been partially supported by the European Union – NextGenerationEU through the Italian Ministry of University and Research, Projects PRIN 2022 “QualAI: Continuous Quality Improvement of AI-based Systems” (grant n. 2022B3BP5S , CUP: H53D23003510006) and by the EMELIOT national project funded by the MUR under the PRIN 2020 program (Contract 2020W3A5FY).
Ethics declarations
Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Communicated by: Gema Rodriguez-Perez and Ben Hermann.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article belongs to the Topical Collection: SI: Registered Reports
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
Reprints and permissions
About this article
Cite this article
Recupito, G., Giordano, G., Ferrucci, F. et al. When code smells meet ML: on the lifecycle of ML-specific code smells in ML-enabled systems. Empir Software Eng 30, 139 (2025). https://doi.org/10.1007/s10664-025-10676-4
- Accepted
- Published
- DOI https://doi.org/10.1007/s10664-025-10676-4
Keywords
- Software engineering for artificial intelligence
- Software quality for artificial intelligence
- Technical debt
- Empirical software engineering