Shanto Rahman
The University of Texas at Austin
shanto.rahman@utexas.edu
Bio
Shanto Rahman is a fifth-year Ph.D. candidate in Electrical and Computer Engineering at The University of Texas at Austin, advised by Prof. August Shi. Her research tackles reliable testing: predicting and explaining failure modes from code, making rare failures reproducible on demand, and automatically repairing tests when nondeterminism or code changes cause breakage. Shanto’s work spans program analysis and machine learning for code, and has appeared in top venues including ICSE, OOPSLA, ASE, and ICST.
She has completed research internships at Google and Amazon Web Services. She previously worked as a Senior Software Engineer at Samsung Research Bangladesh and later joined the Bangladesh University of Professionals as a faculty member, where she taught software engineering courses. Beyond research, she enjoys mentoring, open-sourcing artifacts, and collaborating with industry to turn ideas into practical tools that help engineers ship faster and safer.
Areas of Research
- Computer Systems
Understanding and Improving Flaky Test Classification
Regression testing is important but suffers from the presence of flaky teststests that pass or fail non-deterministically on the same codewasting developer time, inflating CI cost, and hiding real bugs. Prior work reports that fine-tuned large language models (LLMs) can classify flaky tests with high accuracy, but we show these claims are overstated due to flawed evaluation and unrealistic datasets.
We first identify two issues that inflate reported performance: (1) optimistic experimental design (e.g., cross-contamination between train/test via project overlap) and (2) misrepresentation of real-world class distributions for flaky vs. non-flaky tests. After enforcing stricter, cross-project, time-aware splits and curating a realistic datasetFlakeBenchthe prior state-of-the-art model’s F1 drops from 81.82% to 56.62%.
Motivated by this gap, we introduce FlakyLens, a training strategy for fine-tuning flaky-test classifiers that lifts F1 to 65.79% (+9.17pp over the corrected baseline). We also benchmark recent pre-trained LLMs (e.g., CodeLlama, DeepSeekCoder) on the same task; FlakyLens consistently outperforms them, indicating that general-purpose LLMs still lag on this specialized classification problem.
To probe model behavior, we compute per-token attribution on test code and analyze which tokens mostly influence predictions across flaky categories. Using these influential tokens, we perform adversarial perturbations and observe accuracy shifts up to 18.37pp, revealing sensitivity to category-specific surface cues rather than robust semantic understanding.
Overall, our results recalibrate expectations for LLM-based flaky-test classification, provide a reproducible benchmark and protocol, and highlight the need for more robust training and evaluation to ensure reliability in CI pipelines.