Project Summary:

Project Pac-Mo is an AI agent that plays a modified version of the Pac-Man by Bandai Namco Games. The AI of Pac-Mo will be developed based on Minecraft Malmo. The player is named after ‘Robot’. The goal of this game is to get the highest score from each stage, while the agent should avoid a monster in the closed map(17 by 17 for now) that kills the player at once if contacted. The dots, which gives a score in the original game, will be replaced by the ‘gold_ingot’ in Minecraft.

The monster in Pac-Mo, unlike the original game, cannot be eaten by the player; it is controlled by another client and its ability to chase the player is driven finding the shortest path from each monster to the player for each movement of the player. To give a full perspective of the map, a client of a watcher is added as well. The input for the agent will be an information of visible grid cell, such as vertically or horizontally reachable cells from current cell not blocked by walls or monsters. Then the agent will determine its best direction to obtain more “gold_ingot” and not to be killed by the monsters.

Current progress:
Watch the video

Approach:

The Pac-mo uses multi-agent Minecrafts. One for the monster, which implements the Dijkstra algorithm to chase the player, one for the player to perform reinforcement learning based on the Q_table to get the best reward which means maximum gold_ingot without death, and one for the observer which is placed above the map.

- Multi-Agent:
The multi-agent is implemented by using MalmoPython.ClientPool. Also different Xml part is coded for different agent since the placement and game mode is different. The reason why we use Multi agents is that in this case the behavior of the monster is easier to control. Also the perspective of a watcher is given.

- Dijkstra’s Algorithm:
We are using Dijkstra’s shortest path algorithm to calculate the monster’s movement. For each step of movement, the algorithm calculates its next location from current cell in its shortest path; the algorithm monster_action returns its turn ratio relative to the monster’s current degree (turn). Hence, the monster is always chasing to the player with a half of speed of the player.

The following pictures depict the movement of the monster:
Alt Text Alt Text

- Tabular Q-learning:
We are using Tabular Q-learning method for the player’s (robot) movement. In the current map, there are 52 possible path cells (coal_block); each cell has four possible actions: ‘north’,’south’, ‘east’, ‘west’. Normally there are 2 possible paths in most time in this map which is shown below. The walls’ q_value is set to -9999 to be excluded from possible actions. We set epsilon: 0.01, alpha = 0.6, gamma: 1, n: 2.

The following pictures depict the reward of the wall for each situation:
Alt Text Alt Text

Notice that the reward of the wall is -9,999; that score forces the player not to walk through the wall.

The following code is the choose_action function in PacMo version 1.6.

def choose_action(curr_coord, grid, last):
   possible_actions = get_possible_action(get_block_index(curr_coord), grid)
   if curr_coord not in q_table:
      q_table[curr_coord] = {}
    
    for direction, coord in possible_actions:
      if direction not in q_table[curr_coord]:
         q_table[curr_coord][direction] = [0, coord]
    
    tol = 0.0001    
    rnd = random.random()    
    if last[1] != "" and last[1] in q_table[curr_coord]:        
      if q_table[last[0]][last[1]][0] >= 0 and q_table[curr_coord][last[1]][0] >= 0:            
         for i in range(len(possible_actions)):                
            if possible_actions[i][0] == last[1]:                    
               return possible_actions[i]
    
    if rnd < epsilon:        
      a = random.randint(0, len(possible_actions) - 1)
    else:        
      m = max([x[0] for x in q_table[curr_coord].values()])        
      l = list()        
      for direction, coord in possible_actions:            
         for value, coordinate in q_table[curr_coord].values():                
            if abs(value-m) <= tol:                    
               l.append((direction, coord))                
               
      return l[random.randint(0, len(l)-1)]
      
    return possible_actions[a]

Notice in the middle of the code, there is a statement that makes agent go to the same direction as its last direction from its last location if and only if the last and the current q_value of the last direction for each cell is greater than or equal to zero. This mechanism forces the player to go straight in discovered paths. Otherwise, the player selects next direction based on the maximum value on the q_table. Since, the epsilon is relatively small, theoredically, 99% of the choose_action function instances are based on the above procedures. Similarly the turn ratio relative to the player’s current degree (turn) is returned.

The following code is a part of updating q table function:

   G = gamma * value
   G += gamma ** n * q_table[curr_coord][direction][0]
   old_q = q_table[curr_coord][direction][0]
   q_table[curr_coord][direction][0] = old_q + alpha * (G - old_q)


Notice that from choose_action function and the function above, the q_table has both value ([0] th value) and its adjacent coordinates towards its current cell’s relative direction. The coordinates are used to calculate the turn ratio of the player.

The reward of actions is set as if wall, -9999; if monster, -1; if successful movement (from cell to cell), +1; if gold, +1 (extra on top of the successful movement). The q_value is updated once the next cell chosen by the choose_action algorithm is performed. As more times q_value is updated, finally it leads to the best solution.

q_table example:
Alt Text

Evaluation:

- Measurement:
Current evaluation process is based on the number of steps and the number of missions until the player (Robot) reaches to the solution. The term solution is not the best solution yet; in fact, finding the best solution is not trivial since the game have moving monster that is chasing after the player. Hence, we decided to compare the number of missions until some solution for each game in the current version (1.6). Current version has a range of 2-14 missions to some solution; interestingly, most solutions had 163 turns until the end of the game (solution state).

- Comparison by version: 1.6 vs. 1.4
The following graph represents the number of missions until the player reaches some solution:
Alt Text
Notice that version 1.4 did not even reach to any solution during the test of more than 200 missions. However, version 1.6 was able to finish acquiring all twenty-seven ‘gold_ingot’ within 2-14 missions. Hence, the choose_action function revised in 1.6 has better path searching performance than the function in version 1.4

Remaining Goals and Challenges:

For the challenge, it is mainly about how well the q_table would work. And except that, the remaining goal is to make the map and environment much more complicated.

- Improving Q-learing:

We are trying to improve cases, where the q-values must be updated. Current version (1.6) Q-values are accumulated based on the success of each movement. For example, if the player succeed moving to the next cell chosen by the choose_action algorithm, the Q-value is updated with reward 1, in which its direction from the previous cell is updated. The player accumulates an extra 1 reward from picking up the “gold_ingot”. The Q-values are decreased if and only if either the player detects the wall or the player is killed (contacted) by the monster.

Since for most cases, the Q-table are implemented in a static enviroment.(eg lava, fixed item). However for our game, the monster follows the player with Dijkstra’s algorithm, which means the pattern of its movement is somehow unpredictable. In this case, q-table doesn’t work perfectly well. If we can find the only one solution to achieve the best reward without dying, the Q_learing would work perfectly well. Our next goal is to find diverse cases of the q-value update instances and improve the accuracy of the Q-table or to modify the algorithm based on current q-learning to help the agent learn.

- Make more possible paths in the map:

Current version (1.6) has one square shape map. We are going to add more paths in between maps; test out if the player can finish the game as Q-table is accumulated.For example, we use squared with cross inside instead of a simple square.

- More monsters inside the map:

Current version only has one monster agent in the map, which means the agent is less likely to die. We will try to add 1 more or 2 more monsters to see if the agent still performs good. One drawbacks of more monsters is that it is taking more resources from computer, since 3 agents are already taking almost 8G ram. This memory usage problem is probably not going to be solved.