[kaggleコンペ上位解法まとめ]Google Research Football with Manchester City F.C.

1位：WeKick: Temporary 1st place Solution（https://www.kaggle.com/c/google-football/discussion/202232）

■Training framework
・asynchronous archarchitecture：Mastering Complex Control in MOBA Games with Deep Reinforcement Learning（https://arxiv.org/abs/1912.09729）
・GAIL：Generative Adversarial Imitation Learning（https://proceedings.neurips.cc/paper/2016/file/cc7e2b878868cbae992d1fb743995d8f-Paper.pdf）

■Network Architecture
・PPO from openai baseline（https://github.com/openai/baselines）
・a few dense layers (256 dim each) and a LSTM block(32 steps, 256 hidden size)
・Learning rate is 1e-4
・Adam optimizer
・SMM feature and CNN block were soon abandonded due to low traning speed and high memory consumption.
・multi-head value(MHV)：To decrease variance of value estimation（https://arxiv.org/abs/2011.12692）

■Feature Engineering
・relative pose(position, direction) among teammates and opponents
・relative pose between active player and ball
・offside flag to mark potential offside teammates
・sticky actions/card/tried factor status

■Reward
・+0.2 for getting possession, -0.2 for losing possession.
・+0.2 for our successful slide, -0.2 for opponent successful slide.
・+0.001 if team holds the ball(‘ball_owned_team’ = 0), -0.001 if opponent holds the ball(‘ball_owned_team’ = 1).
・+0.1 for each pass before a goal.
・We also designed some specialized rewards to increase diversity（attacks in middle way (prefer attack from the wings)）

■training
・League training：Grandmaster level in StarCraft II（https://www.nature.com/articles/s41586-019-1724-z）
・reward shaping
・demonstrations in the first stage
・combating with each other
・counter-attack favoured、short pass favoured、holding ball favoured

3位：Raw Beast provisional 3rd place solution（https://www.kaggle.com/c/google-football/discussion/200709）

■Training framework
・supervised learning：around the 20th place on the leaderboard
・IMPALA2.8：asynchronous acting and learning with an off-policy correction for the actor-critic agent

■Network Architecture
behavior cloning
・actor-critic：separate actor and critic architecture without any parameter sharing
・to start updating the actor until the critic was at least fairly decent
・actor and critic parameters are optimized with different Adam optimizers
・learning rate of the critic was typically set to two times the learning rate of the actor

・Kullback-Leibler (KL) distribution loss
・cross-entropy loss of the actions
・entropy bonus to the loss
・several shapes of convolutional networks（good balance between speed and size of the model）
・blocks of dense layers, 1-D and 2-D convolutions across players
・concatenated before the last set of dense layers

・Active player block：active player：location、direction、sticky actions.
・Opponents block：1-D convolution over 11 opponents：positions、velocities.
・Teammates block：1-D convolution over 11 opponents：positions、velocities.
・Pairs block：2-D convolution＋average pooling（ per teammate）：10×11 tensor：location、velocity.
・Main opponent：opponent closest to the ball：location、direction、sticky actions.
・Game data：team in possession, ball location and speed, game mode.
・Historical features：like time since last passing or shooting action
・Active opponent data：value function

・0.01-0.15 seconds per action、deeper version was quite efficient

・TD-lambda value loss (lambda 0.9)
・IMPALA policy loss.
・UPGO policy loss
・Entropy bonus
・KL loss to the initial policy（trust-region style KL loss：not change too much）

■Feature Engineering
・reduce the variance：Features related to the opponent controlled player

■Reward
・the solutions got better as the reward was simplified by making it a more straightforward mapping of the real objective
・terminal rewards of -1, 0 (end of a half) or 1 for scoring or conceding a goal

■training
・Half of the experience in our experiments comes from self-play
・self-play：twice as data efficient as playing against other agents
・self-play：detecting the local blind spots in the current policy
・self-play：The downside of only doing self-play is that it can lead to cycles

・The second half of the experience comes from playing against a pool of fixed opponents
・we got better results with a big (about 10) pool of diverse opponent strategies

・leaderboard experience of our agents

・data augmentation：double：symmetric about the x-axis（https://www.purplemath.com/modules/symmetry.htm）
・To decide what submissions are similar：marginal action frequencies、MDS embeddings of action distribution distances
・transfer learning and experienced a significant boost of the clone performance
・network was trained to be able to predict the action probabilities of 20 different submissions after concatenating the torso outputs with a one-hot encoding of the submission id

・running all the experiments in docker containers
・synchronized the work through an s3 bucket: experiment configurations, execution progress and checkpoints, agents, replays.
・each experiment was running in a single container (actor and learner in separate threads)
・next step would have been to synchronize the experience buffer across multiple machines and take the learning to a larger scale

6位：Solutions of team “liveinparis” with codes (temp 6th)（https://www.kaggle.com/c/google-football/discussion/201376）

CODE：https://github.com/seungeunrho/football-paris

■Training framework
・

■Network Architecture
・PPO(Proximal Policy Optimization, Schulman et al.)

・ball features
・controlling(active) player features
・game status features
・1d-conv：features of 11 team players and opponent team players
・a couple of linear layers with non-linear activations

・Action Blocking：the logits of blocked actions are masked to -inf before the softmax layer
・Too far shot is blocked
・All types of pass actions are blocked unless we owns the ball
・Releasing sticky actions become available only if corresponding sticky action is active

・central learner：Actor proceeds simulation and send rollouts(transition tuples of horizon length 30)
・behavior policy and learning policy are equal

■Feature Engineering
・combined 8 directional move actions into one “move action”. This makes 17 actions to 10 actions total.
・Agent first choose action out of remaining 10 actions, and then the direction is chosen only in case the “move action” is selected

■Reward
・scoring : +5 for our goal, -5 for opponent goal（proper scale of cumulative reward for the neural network to learn. is 20）
・ball positoin：devided the field into 5 zones, and corresponding amount of reward is given every step.
・our penalty zone(-0.006)
・our zone(-0.003),
・middle(0.0)
・opponent zone(+0.003)
・opponent penalty zone(+0.006).
・yellowcard : -1 for our card, +1 for opponent card.

■training
・trained solely via self-play
・ Randomly initialized agents(neural network) starts tarining through the experiences from the matches against itself
・The network is saved at every fixed interval (about an hour)
・these saved models compose an opponent pool
・the opponent is sampled from the pool. 50% of the opponent comes from the latest 10 models
・the others come from uniform random sampling from the whole pool
・At the time of final submission, about 380 agents were in the pool

・used 1 actor per 1 cpu core. Our final version of agent is trained with 30 cpu cores and 1 gpu for 370 hours (cpu: AMD Ryzen Threadripper 2950X, gpu : RTX 2080)
・450,000 episodes, and 133M times of mini batch updates(single mini batch composed of 32 rollouts, each rollout composed of 30 state transitions).

■Evaluation
-> used three metrics
・LB score,
・average win rate against the pool
・average win rate against rule-base AI

コメントを残す コメントをキャンセル

コメントを残すコメントをキャンセル