filmov
tv
MARL Example using NevarokML Reinforcement Learning Plugin in Unreal Engine 5 | Multi-Agent Learning

Показать описание
In this technical demonstration, we explore a sophisticated Multi-Agent Reinforcement Learning (MARL) example using the NevarokML plugin in Unreal Engine 5. The focus of this experiment is to simulate intelligent agents akin to "souls-like" NPCs, equipped with swords and shields, to enhance gameplay and create a more engaging gaming experience.
0:00 1 million steps
0:45 1 million steps x10
1:45 5 million steps
2:30 5 million steps x10
3:30 20 million steps
4:15 20 million steps x10
5:15 80 million steps
6:00 80 million steps x10
7:00 Final result
Environment Configuration:
2 Agents in a 10 x 10 arena
Maximum 400 steps per episode, with a step taken every second frame
PPO model with specific hyperparameters:
Gamma (discount factor): 0.95
Learning rate: 0.0003
Number of steps per update (nSteps): 200
Batch size: 200
Entropy coefficient (Ent Coef): 0.1
Action space: MultiDiscrete with 5 indices:
Index 0: Movement along agent's X coordinate (-2 to +2)
Index 1: Movement along agent's Y coordinate (-2 to +2)
Index 2: Yaw turn (-2 to +2)
Index 3: Attack type (0 to 12, 0 means no attack)
Index 4: Block (0 for false, 1 for true)
Observation space: MultiDiscreteStack with 24 indices:
Agent's HP, stamina, and world yaw rotation
Agent's velocity along X and Y axes
Agent's yaw rotation angle to target agent
Distance from agent to target agent
Agent's attack type, block status, and hit status
Target agent's HP, stamina, and world yaw rotation
Target agent's velocity along X and Y axes
Target agent's attack type, block status, and hit status
Stacked memory of previous observations (indices 24 to 70)
Training Details:
Training timesteps: 81,000,000
Number of agents: 200
Training time: Approximately 20 hours
Rewards:
HP decrease penalty: Negative difference in HP
Look At penalty: Absolute difference between the agent's yaw and the target's yaw
Distance reward/penalty: +1 if distance less equal 3, -1 otherwise
Stamina reward/penalty: Positive difference in stamina
Hit reward: +damage value
Miss penalty: -damage value
Target Agent 0 HP reward: +agent HP
Agent 0 HP penalty: -target agent HP
Timeout reward/penalty: If target agent HP less equal 0, reward "Target Agent 0 HP Reward," else penalty "Agent 0 HP Penalty"
Agent Description:
The agent resembles a simplified dark souls-like knight equipped with a sword and shield.
Research Targets:
The main goal is to explore "souls-like" NPC simulation using MARL to develop intelligent agents that provide more engaging and interesting gameplay experiences.
Conclusion:
After training for 81,000,000 timesteps, the agent did not achieve the expected behavior due to incorrect rewards setup and the complexity of the environment, making it difficult to measure performance accurately.
Optimal results were obtained around 5,000,000 steps, with additional steps leading to instability.
Future improvements involve better rewards planning and potentially training each component separately for better performance and easier trainability.
0:00 1 million steps
0:45 1 million steps x10
1:45 5 million steps
2:30 5 million steps x10
3:30 20 million steps
4:15 20 million steps x10
5:15 80 million steps
6:00 80 million steps x10
7:00 Final result
Environment Configuration:
2 Agents in a 10 x 10 arena
Maximum 400 steps per episode, with a step taken every second frame
PPO model with specific hyperparameters:
Gamma (discount factor): 0.95
Learning rate: 0.0003
Number of steps per update (nSteps): 200
Batch size: 200
Entropy coefficient (Ent Coef): 0.1
Action space: MultiDiscrete with 5 indices:
Index 0: Movement along agent's X coordinate (-2 to +2)
Index 1: Movement along agent's Y coordinate (-2 to +2)
Index 2: Yaw turn (-2 to +2)
Index 3: Attack type (0 to 12, 0 means no attack)
Index 4: Block (0 for false, 1 for true)
Observation space: MultiDiscreteStack with 24 indices:
Agent's HP, stamina, and world yaw rotation
Agent's velocity along X and Y axes
Agent's yaw rotation angle to target agent
Distance from agent to target agent
Agent's attack type, block status, and hit status
Target agent's HP, stamina, and world yaw rotation
Target agent's velocity along X and Y axes
Target agent's attack type, block status, and hit status
Stacked memory of previous observations (indices 24 to 70)
Training Details:
Training timesteps: 81,000,000
Number of agents: 200
Training time: Approximately 20 hours
Rewards:
HP decrease penalty: Negative difference in HP
Look At penalty: Absolute difference between the agent's yaw and the target's yaw
Distance reward/penalty: +1 if distance less equal 3, -1 otherwise
Stamina reward/penalty: Positive difference in stamina
Hit reward: +damage value
Miss penalty: -damage value
Target Agent 0 HP reward: +agent HP
Agent 0 HP penalty: -target agent HP
Timeout reward/penalty: If target agent HP less equal 0, reward "Target Agent 0 HP Reward," else penalty "Agent 0 HP Penalty"
Agent Description:
The agent resembles a simplified dark souls-like knight equipped with a sword and shield.
Research Targets:
The main goal is to explore "souls-like" NPC simulation using MARL to develop intelligent agents that provide more engaging and interesting gameplay experiences.
Conclusion:
After training for 81,000,000 timesteps, the agent did not achieve the expected behavior due to incorrect rewards setup and the complexity of the environment, making it difficult to measure performance accurately.
Optimal results were obtained around 5,000,000 steps, with additional steps leading to instability.
Future improvements involve better rewards planning and potentially training each component separately for better performance and easier trainability.
Комментарии