We consider the problem of dynamically matching supply-demand quantities of heterogeneous types. We model the problem in a MDP framework where the outstanding demand is the state variable with the optimal action being the supply-demand quantities matched to maximize the total discounted reward. Preliminary results show that the increase in the state vector dimension exponentially increases the number of feasible actions, which greatly increases the solution time by model-based methods like dynamic programming. This has motivated us to turn to model-free reinforcement learning (RL) methods like Q-learning and Deep Learning algorithms like DQN and DDPG. To prevent estimation bias due to Q-learning, we introduce a penalty function in the loss function of the DDPG that penalizes the deterministic policies based on some prior policy. We call this new algorithm Domain Knowledge DDPG (DKDDPG). We compare the performance of all these algorithms and observe that the DKDDPG achieves higher accuracy and efficiency as compared to other model-free approaches.
Saunak Kumar Panda is a Ph.D. student in the Industrial Engineering department at University of Houston. He received his B.E. in Mechanical Engineering from PES University, India and M.S. in Mechanical Engineering from University of Washington-Seattle. His research interests include Decision making under Uncertainty, Reinforcement Learning and Statistical Machine Learning. He is the marketing coordinator of the INFORMS student chapter at UH and is student member of IISE and INFORMS.