Markov chain, Markov decision process, and deep reinforcement learning with applications to hospital management and real-time ride-hailing
This dissertation summarizes the research in the area of Markov chains, Markov decision processes, and deep reinforcement learning. Motivated from customer queues in a call center or a hospital inpatient ward, the stationary distributions of continuous- and discrete-time Markov chains often have important implications. As the exact forms of these stationary distributions may not exist and furthermore may be time-consuming to evaluate, closed-form approximations of high quality are desirable. Utilizing the Stein's method, we provides a framework to 1) construct an appropriate approximator to the stationary distribution of a Markov chain; and 2) establish an upper bound on the approximation error that is pretty sharp. Extending Markov chains to Markov decision processes, we develop a framework to model a common yet challenging problem posed to the real-time ride-hailing services or platforms such as Uber: In anticipation of future demand/supply conditions, how to carry out car-passenger matching and empty-car routing so as to optimize certain objective over a time horizon of interest -- typically several hours to a day. In face of the immense state and action space entailed by the complicated time-varying environment of the ride-hailing service, we 1) decompose the decision at every decision time point into a sequence of instantaneous actions, each induced by a single car or passenger; and 2) develop an algorithm combining the state-of-the-art deep learning and reinforcement learning (deep reinforcement learning) techniques. While recent years have witnessed the prevalence of deep reinforcement learning research with real-world impacts in a number of applications -- ride-hailing among them, our work is unique in that it seeks to 1) optimize the long-term utility from the perspective of the system instead of the individual cars, in particular, we use global demand and supply conditions as our features, which do not concentrate upon any single agent (car candidate in our case); in fact, standing at any (instantaneous) decision time point, we do not know which car candidate to process (as opposed to following the conventional first-in-first-out rule), and let the features make the selection; 2) learn the optimal policy directly through the reinforcement learning framework, for this purpose, our algorithm includes both policy evaluation and policy improvement steps in multiple policy iterations. Numerical study based on real-world data from Didi Chuxing demonstrates the efficacy of our methodology: It rapidly and stably improves the fraction of ride requests fulfilled to a level comparable to the state-of-the-art result, and after 73 policy iterations, surpasses it by 3%.