Title: Enabling Consistent Data Selection with Representation Shifts
Time: Friday, April 21st, 3:00 PM
Location: CSIP library (room 5126), 5th floor, Centergy one building
Bio: Ryan Benkert is a fourth-year Ph.D. student in the Omni Lab for Intelligent Visual Engineering and Science (OLIVES) at the Georgia Institute of Technology. In his research, he addresses fundamental challenges in machine learning that bridge the gap between academic research and industrial deployment. His interests include active learning, uncertainty estimation, and neural network learning dynamics. Prior to Georgia Tech, he received his B.Sc and M.Sc from the RWTH Aachen University in Germany.
Abstract: Regression describes the performance deterioration after a model update. For modern data acquisition pipelines, performance regression is a major concern as models are updated iteratively with newly acquired data. However, the current standard in several data selection paradigms assumes a direct relationship between model generalization and performance regression, namely that performance regression decreases as more training data becomes available. In this talk, I will discuss different sources of regression and demonstrate empirically that additional data can increase or decrease performance regression independently of the generalization behavior. In particular, I will explore dataset imbalance and class complexity as two influential factors in performance regression. Further, I will consider optimal settings where the selection algorithm has prior knowledge of the regression properties within the dataset. Based on these observations, I will derive an approximate upper bound of performance regression, giving rise to a plug-in algorithm for regression reduction in data acquisition pipelines. The talk concludes with several data acquisition experiments on in-distribution, as well as out-of-distribution, demonstrating a clear reduction with the proposed algorithm.