Article Content

Abstract

The adoption of Machine Learning (ML)-enabled systems is growing rapidly, introducing novel challenges in maintaining quality and managing technical debt in these complex systems. Among the key quality threats are ML-specific code smells (ML-CSs), suboptimal implementation practices in ML pipelines that can compromise system performance, reliability, and maintainability. Although these smells have been defined in the literature, detailed insights into their characteristics, evolution, and mitigation strategies are still needed to help developers address these quality issues effectively. In this paper, we investigate the emergence and evolution of ML-CSs through a large-scale empirical study focusing on (i) their prevalence in real ML-enabled systems, (ii) how they are introduced and removed, and (iii) their survivability. We analyze over 400,000 commits from 337 ML-enabled projects, leveraging CodeSmile, a novel ML smell detector that we developed to enable our investigation and identify ML-specific code smells. Our results reveal that: (1) CodeSmile can detect ML-CSs with precision and recall rates of 87.4% and 78.6%, respectively; (2) ML-CSs are frequently introduced during file modifications in new feature tasks; (3) smells are typically removed during tasks related to new features, enhancements, or refactoring; and (4) the majority of ML-CSs are resolved within the first 10% of commits. Based on these findings, we provide actionable conclusions and insights to guide future research and quality assurance practices for ML-enabled systems.

Explore related subjects

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

  • Machine Learning
  • Open Source
  • Sensory Evaluation
  • Statistical Learning
  • Symbolic AI
  • Technical Languages

Data Availability Statement

The manuscript includes data as electronic supplementary material. In particular, datasets generated and analyzed during the current study, detailed results, scripts, and additional resources useful for reproducing the study are available as part of our online appendix on Figshare (Recupito et al. 2024). In addition, we included the GitHub repository link for CodeSmile: https://github.com/giammariagiordano/smell_ai/tree/main.

Notes

  1. https://chat.openai.com/

  2. Source code available at: https://github.com/huggingface/transformers/blob/main/examples/research_projects/bertology/run_bertology.py

  3. Available at https://github.com/acmsigsoft/EmpiricalStandards

  4. Qualtrics sample size: https://www.qualtrics.com/blog/calculating-sample-size/

  5. Case of a ‘Columns and DataType Not Explicitly Set’ smell-introducing commit when performing refactoring in BrikerMan/Kashgari project: https://github.com/BrikerMan/Kashgari/commit/f7fb43d2f3651fbba92eb6e5cee8bfd279b0317a

  6. Case of a ‘Columns and DataType Not Explicitly Set’ smell-removing commit when performing enhancement in RTIInternational/gobbli project:https://github.com/RTIInternational/gobbli/commit/b93d184c610c3ae779607679501b4b1dafd30b28

  7. Example of a smell-removing commit aware of the ML-CSs: https://github.com/geyang/ml-logger/commit/25aff14cf101ac4db06be02e65d845c54fc38c84

References

  • Azeem MI, Palomba F, Shi L, Wang Q (2019) Machine learning techniques for code smell detection: A systematic literature review and meta-analysis. Inf Softw Technol 108:115–138

    Google Scholar

  • Basili VR, Caldiera G, Rombach HD (1994) The goal question metric approach. Encyclopedia of software engineering pp 528–532

  • Bessghaier N, Ouni A, Mkaouer MW (2020) On the diffusion and impact of code smells in web applications. In: Wang Q, Xia Y, Seshadri S, Zhang LJ (eds) Services Computing – SCC 2020. Springer International Publishing, Cham, pp 67–84

    Google Scholar

  • Cardozo N, Dusparic I, Cabrera C (2023) Prevalence of code smells in reinforcement learning projects

  • Cliff N (1993) Dominance statistics: Ordinal analyses to answer ordinal questions. Psychol Bull 114:494–509. https://api.semanticscholar.org/CorpusID:120113824

  • Conover WJ (1999) Practical nonparametric statistics, vol 350. john wiley & sons

  • Costal D, Gómez C, Martínez-Fernández S (2024) Metrics for code smells of ml pipelines. In: Kadgien R, Jedlitschka A, Janes A, Lenarduzzi V, Li X (eds) Product-focused software process improvement. Springer Nature Switzerland, Cham, pp 3–9

    Google Scholar

  • Cunningham W (1992) The wycash portfolio management system. ACM Sigplan Oops Messenger 4(2):29–30

    Google Scholar

  • Fowler M (2018) Refactoring. Addison-Wesley Professional

    Google Scholar

  • Fowler M, Beck K (1997) Refactoring: Improving the design of existing code. In: 11th European conference. Jyväskylä, Finland

  • Giordano G, Annunziata G, De Lucia A, Palomba F et al (2023) Understanding developer practices and code smells diffusion in ai-enabled software: A preliminary study. In: IWSM-Mensura

  • Giordano G, Fasulo A, Catolino G, Palomba F, Ferrucci F, Gravino C (2022) On the evolution of inheritance and delegation mechanisms and their impact on code quality. In: 2022 IEEE International conference on software analysis, evolution and reengineering (SANER), pp 947–958. IEEE

  • Giordano G, Fasulo A, Catolino G, Palomba F, Ferrucci F, Gravino C (2022) On the evolution of inheritance and delegation mechanisms and their impact on code quality. In: 2022 IEEE International conference on software analysis, evolution and reengineering (SANER), pp 947–958

  • Giordano G, Sellitto G, Sepe A, Palomba F, Ferrucci F (2023) The yin and yang of software quality: On the relationship between design patterns and code smells. In: 2023 49th Euromicro conference on software engineering and advanced applications (SEAA), pp 227–234. IEEE

  • Khomh F, Penta MD, Guéhéneuc YG, Antoniol G (2012) An exploratory study of the impact of antipatterns on class change-and fault-proneness. Empir Softw Eng 17:243–275

    Google Scholar

  • Lehman MM, Ramil JF, Wernick PD, Perry DE, Turski WM (1997) Metrics and laws of software evolution-the nineties view. In: Proceedings fourth international software metrics symposium, pp 20–32. IEEE

  • Lenarduzzi V, Lomio F, Moreschini S, Taibi D, Tamburri DA (2021) Software quality for ai: Where we are now? In: Software quality: future perspectives on software engineering quality: 13th international conference, SWQD 2021, Vienna, Austria, January 19–21, 2021, Proceedings 13, pp 43–53. Springer

  • Martínez-Fernández S, Bogner J, Franch X, Oriol M, Siebert J, Trendowicz A, Vollmer AM, Wagner S (2022) Software engineering for ai-based systems: a survey. ACM Trans Softw Eng Methodol (TOSEM) 31(2):1–59

    Google Scholar

  • Murphy-Hill E, Black AP (2007) Why don’t people use refactoring tools? In: Proceedings of the 1st workshop on refactoring tools, pp 61–62

  • Palomba F, Bavota G, Di Penta M, Fasano F, Oliveto R, De Lucia A (2018) On the diffuseness and the impact on maintainability of code smells: a large scale empirical investigation. In: Proceedings of the 40th international conference on software engineering, pp 482–482

  • Palomba F, Bavota G, Di Penta M, Oliveto R, De Lucia A (2014) Do they really smell bad? a study on developers’ perception of bad code smells. In: 2014 IEEE International conference on software maintenance and evolution, pp 101–110. IEEE

  • Palomba F, Panichella A, Zaidman A, Oliveto R, De Lucia A (2018) The scent of a smell: An extensive comparison between textual and structural smells. IEEE Trans Softw Eng 44(10):977–1000. https://doi.org/10.1109/TSE.2017.2752171

    Article Google Scholar

  • de Paulo Sobrinho EV, De Lucia A, de Almeida Maia M (2018) A systematic literature review on bad smells-5 w’s: which, when, what, who, where. IEEE Trans Softw Eng 47(1):17–66

    Google Scholar

  • Ratzinger J, Sigmund T, Gall HC (2008) On the relation of refactorings and software defect prediction. In: Proceedings of the 2008 international working conference on Mining software repositories, pp 35–38

  • Recupito G, Giordano G, Ferrucci F, Di Nucci D, Palomba F (2024). When code smells meet ml: On the lifecycle of ml-specific code smells in ml-enabled systems – appendix. https://doi.org/10.6084/m9.figshare.28167065

    Article Google Scholar

  • Recupito G, Pecorelli F, Catolino G, Lenarduzzi V, Taibi D, Di Nucci D, Palomba F (2024) Technical debt in ai-enabled systems: On the prevalence, severity, impact, and management strategies for code and architecture. J Syst Softw 216:112151

    Google Scholar

  • Rhmann W (2021) Quantitative software change prediction in open source web projects using time series forecasting. Int J Open Source Softw Processes (IJOSSP) 12(2):36–51

    Google Scholar

  • Riquet N, Devroey X, Vanderose B (2022) Gitdelver enterprise dataset (gded) an industrial closed-source dataset for socio-technical research. In: Proceedings of the 19th international conference on mining software repositories, pp 403–407

  • Rosner B, Glynn RJ (2009) Power and sample size estimation for the wilcoxon rank sum test with application to comparisons of c statistics from alternative prediction models. Biometrics 65(1):188–197. https://doi.org/10.1111/j.1541-0420.2008.01062.x

    Article MathSciNet Google Scholar

  • Sculley D, Holt G, Golovin D, Davydov E, Phillips T, Ebner D, Chaudhary V, Young M, Crespo JF, Dennison D (2015) Hidden technical debt in machine learning systems. Advances in neural information processing systems 28

  • Spadini D, Aniche M, Bacchelli A (2018) Pydriller: Python framework for mining software repositories. In: Proceedings of the 2018 26th ACM Joint meeting on european software engineering conference and symposium on the foundations of software engineering, pp 908–911

  • Taibi D, Janes A, Lenarduzzi V (2017) How developers perceive smells in source code: A replicated study. Inf Softw Technol 92:223–235

    Google Scholar

  • Tang Y, Khatchadourian R, Bagherzadeh M, Singh R, Stewart A, Raja A (2021) An empirical study of refactorings and technical debt in machine learning systems. In: 2021 IEEE/ACM 43rd international conference on software engineering (ICSE), pp 238–250. IEEE

  • Tufano M, Palomba F, Bavota G, Oliveto R, Di Penta M, De Lucia A, Poshyvanyk D (2017) When and why your code starts to smell bad (and whether the smells go away). IEEE Trans Softw Eng 43(11):1063–1088

    Google Scholar

  • Tufano M, Palomba F, Bavota G, Oliveto R, Di Penta M, De Lucia A, Poshyvanyk D (2017) When and why your code starts to smell bad (and whether the smells go away). IEEE Trans Softw Eng 43(11):1063–1088

    Google Scholar

  • Van Oort B, Cruz L, Aniche M, Van Deursen A (2021) The prevalence of code smells in machine learning projects. In: 2021 IEEE/ACM 1st Workshop on AI engineering-software engineering for AI (WAIN), pp. 1–8. IEEE

  • Walter B, Alkhaeir T (2016) The relationship between design patterns and code smells: An exploratory study. Inf Softw Technol 74:127–142

    Google Scholar

  • Wang G, Wang Z, Chen J, Chen X, Yan M (2022) An empirical study on numerical bugs in deep learning programs. In: Proceedings of the 37th IEEE/ACM international conference on automated software engineering, pp 1–5

  • Widyasari R, Yang Z, Thung F, Qin Sim S, Wee F, Lok C, Phan J, Qi H, Tan C, Tay Q, Lo D (2023) Niche: A curated dataset of engineered machine learning projects in python. In: 2023 IEEE/ACM 20th International conference on mining software repositories (MSR), pp 62–66

  • Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A (2012) Experimentation in software engineering. Springer Science & Business Media

  • Zhang H, Cruz L, Van Deursen A (2022) Code smells for machine learning applications. In: Proceedings of the 1st international conference on AI engineering: software engineering for AI, pp 217–228

  • Zhou Y, Leung H, Xu B (2009) Examining the potentially confounding effect of class size on the associations between object-oriented metrics and change-proneness. IEEE Trans Softw Eng 35(5):607–623

    Google Scholar

Download references

Acknowledgements

This work has been partially supported by the European Union – NextGenerationEU through the Italian Ministry of University and Research, Projects PRIN 2022 “QualAI: Continuous Quality Improvement of AI-based Systems” (grant n. 2022B3BP5S , CUP: H53D23003510006) and by the EMELIOT national project funded by the MUR under the PRIN 2020 program (Contract 2020W3A5FY).

Author information

Authors and Affiliations

Contributions

Gilberto Recupito: Formal analysis, Investigation, Data Curation, Validation, Writing – Original Draft, Visualization. Giammaria Giordano: Formal analysis, Investigation, Data Curation, Validation, Writing – Original Draft, Visualization. Filomena Ferrucci: Writing – Review & Editing. Dario Di Nucci: Supervision, Writing – Review & Editing. Fabio Palomba: Supervision, Resources, Writing – Review & Editing.

Corresponding author

Correspondence to Gilberto Recupito.

Ethics declarations

Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Communicated by: Gema Rodriguez-Perez and Ben Hermann.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: SI: Registered Reports

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Recupito, G., Giordano, G., Ferrucci, F. et al. When code smells meet ML: on the lifecycle of ML-specific code smells in ML-enabled systems. Empir Software Eng 30, 139 (2025). https://doi.org/10.1007/s10664-025-10676-4

Download citation

  • Accepted 
  • Published 
  • DOI  https://doi.org/10.1007/s10664-025-10676-4

Keywords

  • Software engineering for artificial intelligence
  • Software quality for artificial intelligence
  • Technical debt
  • Empirical software engineering
WhatsApp