Sihan Zeng

Title: Two-Agent Competitive Reinforcement Learning

Time: Friday, April 7th, 3:00 PM
Location: CSIP library (room 5126), 5th floor, Centergy one building

Bio: Sihan Zeng is a final-year PhD student at Georgia Tech, working with Dr. Justin Romberg. His research interests lie in reinforcement learning, optimization, and applied probability. He received the B.S. in Electrical Engineering and B.A. in Statistics in 2017 from Rice University in Houston, Texas.

Abstract: Multi-agent reinforcement learning studies the sequential decision making problem in the scenario where multiple agents co-exist in the same environment and jointly determine the environment transition and/or reward function. In this talk we consider two specific multi-agent settings and discuss the structure of the underlying optimization problems. 

The first setting is the two-player zero-sum Markov game, in which one agent maximizes the same cumulative reward that the other agent seeks to minimize. Usually formulated as a nonconvex-nonconcave minimax optimization program, this problem is notoriously hard to solve with direct policy optimization algorithms. Our approach to the challenge is to introduce strong structure to the Markov game through an entropy regularization. We apply direct gradient descent ascent on the regularized objective and propose schemes of adjusting the regularization weight to make the algorithm converge to a global solution of the original unregularized problem. The convergence rate of the proposed algorithm vastly improves over the existing convergence bounds for gradient descent ascent algorithms.

In the second part of the talk, we start by presenting an equiconnectedness property of the objective function in the single-agent policy optimization problem, both in the tabular setting and under policies parameterized by a sufficiently large neural network. As a consequence of the property, we derive a minimax theorem for a robust reinforcement learning problem, where the learning agent defends against an adversary that attacks its reward function. This is the first time such a result is established in the literature. We conclude by pointing out a few ways to extend our work in both directions.