Stefan’s Blog - Dealing with Partial Observability In Reinforcement Learning

Important

This blog post is still work in progress. Currently, there seems to be an issue with attention in RLlib. I have not had the time to look into this again but still wanted to share the current state since it might still be useful.

In reinforcement learning (RL), the RL agent typically selects a suitable action based on the last observation. In many practical environments, the full state can only be observed partially, such that important information may be missing when just considering the last observation. This blog post covers options for dealing with missing and only partially observed state, e.g., considering a sequence of last observations and applying self-attention to this sequence.

Note

This blog post is based on and very related to Anyscale’s blog post on attention nets with RLlib. In comparison, I focus less on RLlib’s trajectory API and more on providing a practical, end-to-end tutorial.

Example: The CartPole Gym Environment

As an example, consider the popular OpenAI Gym CartPole environment. Here, the task is to move a cart left or right in order to balance a pole on the cart as long as possible.

In the normal CartPole-v1 environment, the RL agent observes four scalar values (defined here): * The cart position, i.e., where the cart currently is. * The cart velocity, i.e., how fast the cart is currently moving and in which direction (can be positive or negative). * The pole angle, i.e., how tilted the pole currently is and in which direction. * The pole angular velocity, i.e., how fast the pole is currently moving and in which direction.

All four observations are important to decide whether the cart should move left or right.

Now, assume the RL agent only has access to an instant snapshot of the cart and the pole (e.g., through a photo/raw pixels) and can neither observe cart velocity nor pole angular velocity. In this case, the RL agent only has partial observations and does not know whether and how fast the pole is currently swinging. As a result, standard RL agents cannot solve the problem and do not learn to balance the pole. How to deal with this problem of partial observations, i.e., missing state (here, cart and pole velocity)?

Options for Dealing With Partial Observations

There are different options for dealing with partial observations/missing state, e.g., missing velocity in the CartPole example:

Add the missing state explicitly, e.g., measure and observe velocity. Note that this may require installing extra sensors or may even be infeasible in some scenarios.
Ignore the missing state, i.e., just rely on the available, partial observations. Depending on the missing state, this may be problematic and keep the agent from learning.
Keep track of a sequence of the last observations. By observing the cart position and pole angle over time, the agent can implicitly derive their velocity. There are different ways to deal with this sequence:
1. Just use the sequence as is for a standard multi-layer perceptron (MLP)/dense feedforward neural network.
2. Feed the sequence into a recurrent neural network (RNN), e.g., with long short-term memory (LSTM).
3. Feed the sequence into a neural network with self-attention.

In the following, I go through each option in more detail and illustrate them using simple example code.

Setup

For the examples, I use a PPO RL agent from Ray RLlib with the CartPole environment, described above.

To install these dependencies, run the following code (tested with Python 3.8 on Windows):

#collapse-output
import os.path
!pip install ray[rllib]==1.8.0
!pip install tensorflow==2.7.0
!pip install seaborn==0.11.2
!pip install gym==0.21.0
!pip install pyglet==1.5.21

Start up ray, load the default PPO config, and determine the number of training iterations, which is the same for all options (for comparability).

import ray
from ray.rllib.agents import ppo

# adjust num_cpus and num_gpus to your system
# for some reason, num_cpus=2 gets stuck on my system (when trying to train)
ray.init(num_cpus=3, ignore_reinit_error=True)

# stop conditions based on training iterations (each with 4000 train steps)
train_iters = 10
stop = {"training_iteration": train_iters}

2021-12-01 22:52:23,565 INFO worker.py:832 -- Calling ray.init() again after it has already been called.

Option 1: Explicitly Add Missing State

Sometimes, it is possible to extend the observations and explicitly add important state that was previously unobserved. In the CartPole example, the cart and pole velocity can simply be “added” by using the default CartPole-v1 environment. Here, the cart velocity and pole velocity are already included in the observations.

Note that in many practical scenarios such “missing” state cannot be added and observed simply. Instead, it may require installing additional sensors or may even be completely infeasible.

Let’s start with the best case, i.e., explicitly including the missing state.

import gym

# the default CartPole env has all 4 observations: position and velocity of both cart and pole
env = gym.make("CartPole-v1")
env.observation_space.shape

(4,)

#collapse-output

# run PPO on the default CartPole-v1 env
config1 = ppo.DEFAULT_CONFIG.copy()
config1["env"] = "CartPole-v1"

# training takes a while
results1 = ray.tune.run("PPO", config=config1, stop=stop)
print("Option 1: Training finished successfully")

== Status ==
Current time: 2021-12-01 22:52:39 (running for 00:00:00.16)
Memory usage on this node: 9.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\git-repos\private\blog\_notebooks\results\PPO
Number of trials: 1/1 (1 PENDING)

Trial name	status	loc
PPO_CartPole-v1_0091e_00000	PENDING

(pid=16556) 2021-12-01 22:52:50,305 INFO trainer.py:753 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
(pid=16556) 2021-12-01 22:52:50,310 INFO ppo.py:166 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
(pid=16556) 2021-12-01 22:52:50,310 INFO trainer.py:770 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(pid=2436) 2021-12-01 22:53:02,005  WARNING deprecation.py:38 -- DeprecationWarning: `SampleBatch['is_training']` has been deprecated. Use `SampleBatch.is_training` instead. This will raise an error in the future!
(pid=16556) 2021-12-01 22:53:04,522 WARNING deprecation.py:38 -- DeprecationWarning: `SampleBatch['is_training']` has been deprecated. Use `SampleBatch.is_training` instead. This will raise an error in the future!
(pid=16556) 2021-12-01 22:53:05,755 WARNING trainer_template.py:185 -- `execution_plan` functions should accept `trainer`, `workers`, and `config` as args!
(pid=16556) 2021-12-01 22:53:05,755 INFO trainable.py:110 -- Trainable.setup took 15.450 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
(pid=16556) 2021-12-01 22:53:05,755 WARNING util.py:57 -- Install gputil for GPU system monitoring.
(pid=16556) 2021-12-01 22:53:12,536 WARNING deprecation.py:38 -- DeprecationWarning: `slice` has been deprecated. Use `SampleBatch[start:stop]` instead. This will raise an error in the future!
(pid=16556) Windows fatal exception: access violation
(pid=16556) 
(pid=2436) [2021-12-01 22:54:37,017 E 2436 8960] raylet_client.cc:159: IOError: Unknown error [RayletClient] Failed to disconnect from raylet.
(pid=2436) Windows fatal exception: access violation
(pid=2436) 
(pid=14216) [2021-12-01 22:54:37,018 E 14216 11712] raylet_client.cc:159: IOError: Unknown error [RayletClient] Failed to disconnect from raylet.
(pid=14216) Windows fatal exception: access violation
(pid=14216) 
2021-12-01 22:54:37,138 INFO tune.py:630 -- Total run time: 118.07 seconds (117.55 seconds for the tuning loop).

== Status ==
Current time: 2021-12-01 22:53:05 (running for 00:00:26.46)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\git-repos\private\blog\_notebooks\results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_CartPole-v1_0091e_00000	RUNNING	127.0.0.1:16556

== Status ==
Current time: 2021-12-01 22:53:06 (running for 00:00:27.52)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\git-repos\private\blog\_notebooks\results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_CartPole-v1_0091e_00000	RUNNING	127.0.0.1:16556

== Status ==
Current time: 2021-12-01 22:53:12 (running for 00:00:32.87)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\git-repos\private\blog\_notebooks\results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_CartPole-v1_0091e_00000	RUNNING	127.0.0.1:16556

== Status ==
Current time: 2021-12-01 22:53:17 (running for 00:00:37.94)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\git-repos\private\blog\_notebooks\results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_CartPole-v1_0091e_00000	RUNNING	127.0.0.1:16556

Result for PPO_CartPole-v1_0091e_00000:
  agent_timesteps_total: 4000
  custom_metrics: {}
  date: 2021-12-01_22-53-17
  done: false
  episode_len_mean: 20.331632653061224
  episode_media: {}
  episode_reward_max: 69.0
  episode_reward_mean: 20.331632653061224
  episode_reward_min: 8.0
  episodes_this_iter: 196
  episodes_total: 196
  experiment_id: 9b99b97d259948058ce175fdb437bf92
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6666923761367798
          entropy_coeff: 0.0
          kl: 0.02727562002837658
          model: {}
          policy_loss: -0.03548957407474518
          total_loss: 163.0438232421875
          vf_explained_var: 0.02411726862192154
          vf_loss: 163.0738525390625
    num_agent_steps_sampled: 4000
    num_agent_steps_trained: 4000
    num_steps_sampled: 4000
    num_steps_trained: 4000
  iterations_since_restore: 1
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 85.05625
    ram_util_percent: 85.14375
  pid: 16556
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.1063987523513995
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.12976663512813205
    mean_inference_ms: 2.738906715446057
    mean_raw_obs_processing_ms: 0.29470778093728145
  time_since_restore: 11.613266468048096
  time_this_iter_s: 11.613266468048096
  time_total_s: 11.613266468048096
  timers:
    learn_throughput: 829.341
    learn_time_ms: 4823.104
    load_throughput: 0.0
    load_time_ms: 0.0
    sample_throughput: 590.463
    sample_time_ms: 6774.35
    update_time_ms: 2.998
  timestamp: 1638395597
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 4000
  training_iteration: 1
  trial_id: 0091e_00000
  
Result for PPO_CartPole-v1_0091e_00000:
  agent_timesteps_total: 8000
  custom_metrics: {}
  date: 2021-12-01_22-53-27
  done: false
  episode_len_mean: 43.5
  episode_media: {}
  episode_reward_max: 128.0
  episode_reward_mean: 43.5
  episode_reward_min: 9.0
  episodes_this_iter: 85
  episodes_total: 281
  experiment_id: 9b99b97d259948058ce175fdb437bf92
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 4.999999873689376e-05
          entropy: 0.6100984811782837
          entropy_coeff: 0.0
          kl: 0.018913770094513893
          model: {}
          policy_loss: -0.03986572101712227
          total_loss: 392.36260986328125
          vf_explained_var: 0.05626700446009636
          vf_loss: 392.3967590332031
    num_agent_steps_sampled: 8000
    num_agent_steps_trained: 8000
    num_steps_sampled: 8000
    num_steps_trained: 8000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 2
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 85.2
    ram_util_percent: 85.08571428571429
  pid: 16556
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.10915718929193942
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.12762452156562074
    mean_inference_ms: 2.478078257762592
    mean_raw_obs_processing_ms: 0.24854778323044394
  time_since_restore: 21.935389518737793
  time_this_iter_s: 10.322123050689697
  time_total_s: 21.935389518737793
  timers:
    learn_throughput: 806.601
    learn_time_ms: 4959.079
    load_throughput: 8015870.043
    load_time_ms: 0.499
    sample_throughput: 474.211
    sample_time_ms: 8435.067
    update_time_ms: 3.001
  timestamp: 1638395607
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 8000
  training_iteration: 2
  trial_id: 0091e_00000
  
Result for PPO_CartPole-v1_0091e_00000:
  agent_timesteps_total: 12000
  custom_metrics: {}
  date: 2021-12-01_22-53-37
  done: false
  episode_len_mean: 70.15
  episode_media: {}
  episode_reward_max: 292.0
  episode_reward_mean: 70.15
  episode_reward_min: 11.0
  episodes_this_iter: 36
  episodes_total: 317
  experiment_id: 9b99b97d259948058ce175fdb437bf92
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 4.999999873689376e-05
          entropy: 0.5675911903381348
          entropy_coeff: 0.0
          kl: 0.009604203514754772
          model: {}
          policy_loss: -0.02363675646483898
          total_loss: 785.93994140625
          vf_explained_var: 0.09917476773262024
          vf_loss: 785.960693359375
    num_agent_steps_sampled: 12000
    num_agent_steps_trained: 12000
    num_steps_sampled: 12000
    num_steps_trained: 12000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 3
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 83.96428571428571
    ram_util_percent: 85.07857142857142
  pid: 16556
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.11086732721675017
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.129464220418653
    mean_inference_ms: 2.403483071642798
    mean_raw_obs_processing_ms: 0.23551804810537205
  time_since_restore: 31.98438596725464
  time_this_iter_s: 10.048996448516846
  time_total_s: 31.98438596725464
  timers:
    learn_throughput: 827.692
    learn_time_ms: 4832.716
    load_throughput: 6088260.312
    load_time_ms: 0.657
    sample_throughput: 436.537
    sample_time_ms: 9163.022
    update_time_ms: 2.669
  timestamp: 1638395617
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 12000
  training_iteration: 3
  trial_id: 0091e_00000
  
Result for PPO_CartPole-v1_0091e_00000:
  agent_timesteps_total: 16000
  custom_metrics: {}
  date: 2021-12-01_22-53-46
  done: false
  episode_len_mean: 97.99
  episode_media: {}
  episode_reward_max: 371.0
  episode_reward_mean: 97.99
  episode_reward_min: 11.0
  episodes_this_iter: 20
  episodes_total: 337
  experiment_id: 9b99b97d259948058ce175fdb437bf92
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 4.999999873689376e-05
          entropy: 0.5582262873649597
          entropy_coeff: 0.0
          kl: 0.0037608244456350803
          model: {}
          policy_loss: -0.012490973807871342
          total_loss: 696.2131958007812
          vf_explained_var: 0.2233099341392517
          vf_loss: 696.2244873046875
    num_agent_steps_sampled: 16000
    num_agent_steps_trained: 16000
    num_steps_sampled: 16000
    num_steps_trained: 16000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 4
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 83.60769230769232
    ram_util_percent: 85.76923076923075
  pid: 16556
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.1126154093070354
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.1320370369396816
    mean_inference_ms: 2.352884663632094
    mean_raw_obs_processing_ms: 0.2294510967493389
  time_since_restore: 40.960214138031006
  time_this_iter_s: 8.975828170776367
  time_total_s: 40.960214138031006
  timers:
    learn_throughput: 839.249
    learn_time_ms: 4766.169
    load_throughput: 8117680.416
    load_time_ms: 0.493
    sample_throughput: 438.213
    sample_time_ms: 9127.987
    update_time_ms: 2.002
  timestamp: 1638395626
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 16000
  training_iteration: 4
  trial_id: 0091e_00000
  
Result for PPO_CartPole-v1_0091e_00000:
  agent_timesteps_total: 20000
  custom_metrics: {}
  date: 2021-12-01_22-53-56
  done: false
  episode_len_mean: 132.51
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 132.51
  episode_reward_min: 12.0
  episodes_this_iter: 15
  episodes_total: 352
  experiment_id: 9b99b97d259948058ce175fdb437bf92
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.15000000596046448
          cur_lr: 4.999999873689376e-05
          entropy: 0.5601204037666321
          entropy_coeff: 0.0
          kl: 0.0012711239978671074
          model: {}
          policy_loss: -0.007236114237457514
          total_loss: 605.6217041015625
          vf_explained_var: 0.29318979382514954
          vf_loss: 605.6287841796875
    num_agent_steps_sampled: 20000
    num_agent_steps_trained: 20000
    num_steps_sampled: 20000
    num_steps_trained: 20000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 5
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 83.5
    ram_util_percent: 86.91538461538461
  pid: 16556
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.11232943522683439
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.1322710531601683
    mean_inference_ms: 2.318162079591458
    mean_raw_obs_processing_ms: 0.22431872747963477
  time_since_restore: 50.56009912490845
  time_this_iter_s: 9.599884986877441
  time_total_s: 50.56009912490845
  timers:
    learn_throughput: 851.742
    learn_time_ms: 4696.257
    load_throughput: 10147100.52
    load_time_ms: 0.394
    sample_throughput: 431.646
    sample_time_ms: 9266.849
    update_time_ms: 1.601
  timestamp: 1638395636
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 20000
  training_iteration: 5
  trial_id: 0091e_00000
  
Result for PPO_CartPole-v1_0091e_00000:
  agent_timesteps_total: 24000
  custom_metrics: {}
  date: 2021-12-01_22-54-04
  done: false
  episode_len_mean: 162.46
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 162.46
  episode_reward_min: 13.0
  episodes_this_iter: 16
  episodes_total: 368
  experiment_id: 9b99b97d259948058ce175fdb437bf92
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.07500000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.5491490960121155
          entropy_coeff: 0.0
          kl: 0.012883742339909077
          model: {}
          policy_loss: -0.014221735298633575
          total_loss: 350.70465087890625
          vf_explained_var: 0.5025997757911682
          vf_loss: 350.7178955078125
    num_agent_steps_sampled: 24000
    num_agent_steps_trained: 24000
    num_steps_sampled: 24000
    num_steps_trained: 24000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 6
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 69.30000000000001
    ram_util_percent: 87.18181818181819
  pid: 16556
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.11068721601459906
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.1316580125011057
    mean_inference_ms: 2.2647407230497265
    mean_raw_obs_processing_ms: 0.21722746948586308
  time_since_restore: 58.025999307632446
  time_this_iter_s: 7.465900182723999
  time_total_s: 58.025999307632446
  timers:
    learn_throughput: 884.584
    learn_time_ms: 4521.899
    load_throughput: 12176520.624
    load_time_ms: 0.329
    sample_throughput: 439.637
    sample_time_ms: 9098.425
    update_time_ms: 1.335
  timestamp: 1638395644
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 24000
  training_iteration: 6
  trial_id: 0091e_00000
  
Result for PPO_CartPole-v1_0091e_00000:
  agent_timesteps_total: 28000
  custom_metrics: {}
  date: 2021-12-01_22-54-12
  done: false
  episode_len_mean: 196.68
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 196.68
  episode_reward_min: 15.0
  episodes_this_iter: 8
  episodes_total: 376
  experiment_id: 9b99b97d259948058ce175fdb437bf92
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.07500000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.5566104650497437
          entropy_coeff: 0.0
          kl: 0.0053793760016560555
          model: {}
          policy_loss: -0.009221607819199562
          total_loss: 434.2621765136719
          vf_explained_var: 0.1736932396888733
          vf_loss: 434.2709655761719
    num_agent_steps_sampled: 28000
    num_agent_steps_trained: 28000
    num_steps_sampled: 28000
    num_steps_trained: 28000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 7
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 71.21818181818182
    ram_util_percent: 87.13636363636364
  pid: 16556
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.11012958319487443
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.13087477368874187
    mean_inference_ms: 2.2335852221532058
    mean_raw_obs_processing_ms: 0.2131696843540125
  time_since_restore: 66.21467590332031
  time_this_iter_s: 8.188676595687866
  time_total_s: 66.21467590332031
  timers:
    learn_throughput: 900.369
    learn_time_ms: 4442.624
    load_throughput: 14205940.728
    load_time_ms: 0.282
    sample_throughput: 447.857
    sample_time_ms: 8931.421
    update_time_ms: 1.144
  timestamp: 1638395652
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 28000
  training_iteration: 7
  trial_id: 0091e_00000
  
Result for PPO_CartPole-v1_0091e_00000:
  agent_timesteps_total: 32000
  custom_metrics: {}
  date: 2021-12-01_22-54-21
  done: false
  episode_len_mean: 229.19
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 229.19
  episode_reward_min: 15.0
  episodes_this_iter: 9
  episodes_total: 385
  experiment_id: 9b99b97d259948058ce175fdb437bf92
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.07500000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.5450154542922974
          entropy_coeff: 0.0
          kl: 0.0061668953858315945
          model: {}
          policy_loss: -0.006067643407732248
          total_loss: 457.9305114746094
          vf_explained_var: 0.032116785645484924
          vf_loss: 457.9361267089844
    num_agent_steps_sampled: 32000
    num_agent_steps_trained: 32000
    num_steps_sampled: 32000
    num_steps_trained: 32000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 8
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 76.9
    ram_util_percent: 87.25833333333333
  pid: 16556
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.10949012056065484
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.12963875953335788
    mean_inference_ms: 2.2011992223423453
    mean_raw_obs_processing_ms: 0.20929612392916075
  time_since_restore: 74.98133444786072
  time_this_iter_s: 8.766658544540405
  time_total_s: 74.98133444786072
  timers:
    learn_throughput: 920.017
    learn_time_ms: 4347.744
    load_throughput: 16235360.832
    load_time_ms: 0.246
    sample_throughput: 446.868
    sample_time_ms: 8951.185
    update_time_ms: 1.001
  timestamp: 1638395661
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 32000
  training_iteration: 8
  trial_id: 0091e_00000
  
Result for PPO_CartPole-v1_0091e_00000:
  agent_timesteps_total: 36000
  custom_metrics: {}
  date: 2021-12-01_22-54-28
  done: false
  episode_len_mean: 260.25
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 260.25
  episode_reward_min: 15.0
  episodes_this_iter: 8
  episodes_total: 393
  experiment_id: 9b99b97d259948058ce175fdb437bf92
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.07500000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.5320675373077393
          entropy_coeff: 0.0
          kl: 0.0075341472402215
          model: {}
          policy_loss: -0.0072624157182872295
          total_loss: 404.454345703125
          vf_explained_var: 0.05579644814133644
          vf_loss: 404.4610290527344
    num_agent_steps_sampled: 36000
    num_agent_steps_trained: 36000
    num_steps_sampled: 36000
    num_steps_trained: 36000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 9
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 68.26999999999998
    ram_util_percent: 87.2
  pid: 16556
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.10868070777730038
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.1282069423856235
    mean_inference_ms: 2.1709855087013663
    mean_raw_obs_processing_ms: 0.2055371812270087
  time_since_restore: 82.31100749969482
  time_this_iter_s: 7.3296730518341064
  time_total_s: 82.31100749969482
  timers:
    learn_throughput: 939.355
    learn_time_ms: 4258.24
    load_throughput: 18264780.936
    load_time_ms: 0.219
    sample_throughput: 455.121
    sample_time_ms: 8788.864
    update_time_ms: 0.89
  timestamp: 1638395668
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 36000
  training_iteration: 9
  trial_id: 0091e_00000
  
Result for PPO_CartPole-v1_0091e_00000:
  agent_timesteps_total: 40000
  custom_metrics: {}
  date: 2021-12-01_22-54-36
  done: true
  episode_len_mean: 292.74
  episode_media: {}
  episode_reward_max: 500.0
  episode_reward_mean: 292.74
  episode_reward_min: 15.0
  episodes_this_iter: 9
  episodes_total: 402
  experiment_id: 9b99b97d259948058ce175fdb437bf92
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.07500000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.5234162211418152
          entropy_coeff: 0.0
          kl: 0.004971951246261597
          model: {}
          policy_loss: -0.0019533345475792885
          total_loss: 415.5965576171875
          vf_explained_var: 0.15562385320663452
          vf_loss: 415.59814453125
    num_agent_steps_sampled: 40000
    num_agent_steps_trained: 40000
    num_steps_sampled: 40000
    num_steps_trained: 40000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 10
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 79.35000000000001
    ram_util_percent: 87.0
  pid: 16556
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.10759807711359962
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.12680782099974997
    mean_inference_ms: 2.1360139834425205
    mean_raw_obs_processing_ms: 0.20154871655205042
  time_since_restore: 90.66206645965576
  time_this_iter_s: 8.351058959960938
  time_total_s: 90.66206645965576
  timers:
    learn_throughput: 951.379
    learn_time_ms: 4204.422
    load_throughput: 6695620.386
    load_time_ms: 0.597
    sample_throughput: 458.038
    sample_time_ms: 8732.908
    update_time_ms: 1.201
  timestamp: 1638395676
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 40000
  training_iteration: 10
  trial_id: 0091e_00000
  
Option 1: Training finished successfully

== Status ==
Current time: 2021-12-01 22:53:22 (running for 00:00:43.31)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\git-repos\private\blog\_notebooks\results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_CartPole-v1_0091e_00000	RUNNING	127.0.0.1:16556	1	11.6133	4000	20.3316	69	8	20.3316

== Status ==
Current time: 2021-12-01 22:53:27 (running for 00:00:48.40)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\git-repos\private\blog\_notebooks\results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_CartPole-v1_0091e_00000	RUNNING	127.0.0.1:16556	1	11.6133	4000	20.3316	69	8	20.3316

== Status ==
Current time: 2021-12-01 22:53:32 (running for 00:00:53.64)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\git-repos\private\blog\_notebooks\results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_CartPole-v1_0091e_00000	RUNNING	127.0.0.1:16556	2	21.9354	8000	43.5	128	9	43.5

== Status ==
Current time: 2021-12-01 22:53:38 (running for 00:00:59.59)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\git-repos\private\blog\_notebooks\results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_CartPole-v1_0091e_00000	RUNNING	127.0.0.1:16556	3	31.9844	12000	70.15	292	11	70.15

== Status ==
Current time: 2021-12-01 22:53:43 (running for 00:01:04.64)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\git-repos\private\blog\_notebooks\results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_CartPole-v1_0091e_00000	RUNNING	127.0.0.1:16556	3	31.9844	12000	70.15	292	11	70.15

== Status ==
Current time: 2021-12-01 22:53:49 (running for 00:01:10.64)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\git-repos\private\blog\_notebooks\results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_CartPole-v1_0091e_00000	RUNNING	127.0.0.1:16556	4	40.9602	16000	97.99	371	11	97.99

== Status ==
Current time: 2021-12-01 22:53:55 (running for 00:01:15.74)
Memory usage on this node: 10.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\git-repos\private\blog\_notebooks\results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_CartPole-v1_0091e_00000	RUNNING	127.0.0.1:16556	4	40.9602	16000	97.99	371	11	97.99

== Status ==
Current time: 2021-12-01 22:54:00 (running for 00:01:21.29)
Memory usage on this node: 10.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\git-repos\private\blog\_notebooks\results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_CartPole-v1_0091e_00000	RUNNING	127.0.0.1:16556	5	50.5601	20000	132.51	500	12	132.51

== Status ==
Current time: 2021-12-01 22:54:06 (running for 00:01:26.77)
Memory usage on this node: 10.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\git-repos\private\blog\_notebooks\results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_CartPole-v1_0091e_00000	RUNNING	127.0.0.1:16556	6	58.026	24000	162.46	500	13	162.46

== Status ==
Current time: 2021-12-01 22:54:11 (running for 00:01:31.87)
Memory usage on this node: 10.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\git-repos\private\blog\_notebooks\results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_CartPole-v1_0091e_00000	RUNNING	127.0.0.1:16556	6	58.026	24000	162.46	500	13	162.46

== Status ==
Current time: 2021-12-01 22:54:16 (running for 00:01:37.04)
Memory usage on this node: 10.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\git-repos\private\blog\_notebooks\results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_CartPole-v1_0091e_00000	RUNNING	127.0.0.1:16556	7	66.2147	28000	196.68	500	15	196.68

== Status ==
Current time: 2021-12-01 22:54:23 (running for 00:01:43.79)
Memory usage on this node: 10.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\git-repos\private\blog\_notebooks\results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_CartPole-v1_0091e_00000	RUNNING	127.0.0.1:16556	8	74.9813	32000	229.19	500	15	229.19

== Status ==
Current time: 2021-12-01 22:54:28 (running for 00:01:48.84)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\git-repos\private\blog\_notebooks\results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_CartPole-v1_0091e_00000	RUNNING	127.0.0.1:16556	8	74.9813	32000	229.19	500	15	229.19

== Status ==
Current time: 2021-12-01 22:54:34 (running for 00:01:55.21)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\git-repos\private\blog\_notebooks\results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_CartPole-v1_0091e_00000	RUNNING	127.0.0.1:16556	9	82.311	36000	260.25	500	15	260.25

== Status ==
Current time: 2021-12-01 22:54:36 (running for 00:01:57.59)
Memory usage on this node: 10.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\git-repos\private\blog\_notebooks\results\PPO
Number of trials: 1/1 (1 TERMINATED)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_CartPole-v1_0091e_00000	TERMINATED	127.0.0.1:16556	10	90.6621	40000	292.74	500	15	292.74

# check and print results
def print_reward(results):
    results.default_metric = "episode_reward_mean"
    results.default_mode = "max"
    # print mean number of time steps the pole was balanced (higher = better)
    reward = results.best_result["episode_reward_mean"]
    print(f"Reward after {train_iters} training iterations: {reward}")

print_reward(results1)

Reward after 10 training iterations: 292.74

# plot the last 100 episode rewards
import seaborn as sns

def plot_rewards(results):
    """Plot scatter plot of the last 100 training episodes"""
    eps_rewards = results.best_result["hist_stats"]["episode_reward"]
    eps = [i for i in range(len(eps_rewards))]
    ax = sns.scatterplot(eps, eps_rewards)
    ax.set_title("Reward over the last 100 Episodes")
    ax.set_xlabel("Episodes")
    ax.set_ylabel("Episode Reward")


plot_rewards(results1)

c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

import os
import pandas as pd

# plot complete learning curve based on logged progress
def plot_learning(results, label=None):
    """Plot lineplot of the mean episode reward over all training iterations"""
    progress_path = os.path.join(results.best_logdir, "progress.csv")
    df = pd.read_csv(progress_path)
    ax = sns.lineplot(x=df["training_iteration"], y=df["episode_reward_mean"], label=label)
    ax.set_title("Mean Episode Reward over Training Iterations")

plot_learning(results1, label="1: Full Observations")

Including the missing state helps the agent learn a good policy quickly, leading to high reward.

Option 2: Ignore Missing State

In many practical scenarios, missing state cannot be simply added to complete the partial observations, e.g., because measuring/capturing the missing observations incurs prohibitive costs or is physically not feasible.

In this case, the simplest alternative is using the partial observations as they are available. This works if the observations still include enough information to learn a useful policy.

However, if too much important information is missing, learning a useful policy becomes slow or even impossible. In the CartPole example, partial observations that do not include the velocity of the cart and the pole keep the agent from learning a useful policy.

#collapse-output

from ray.rllib.examples.env.stateless_cartpole import StatelessCartPole
from ray.tune import registry

registry.register_env("StatelessCartPole", lambda _: StatelessCartPole())
config2 = ppo.DEFAULT_CONFIG.copy()
config2["env"] = "StatelessCartPole"
# train; this takes a while
results2 = ray.tune.run("PPO", config=config2, stop=stop)
print("Option 2: Training finished successfully")

== Status ==
Current time: 2021-12-01 22:57:23 (running for 00:00:00.16)
Memory usage on this node: 9.7/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 PENDING)

Trial name	status	loc
PPO_StatelessCartPole_aa22d_00000	PENDING

(pid=None) c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\redis\connection.py:77: UserWarning: redis-py works best with hiredis. Please consider installing
(pid=None)   warnings.warn(msg)
(pid=9044) 2021-12-01 22:57:37,705  INFO trainer.py:753 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
(pid=9044) 2021-12-01 22:57:37,705  INFO ppo.py:166 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
(pid=9044) 2021-12-01 22:57:37,705  INFO trainer.py:770 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(pid=None) c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\redis\connection.py:77: UserWarning: redis-py works best with hiredis. Please consider installing
(pid=None)   warnings.warn(msg)
(pid=None) c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\redis\connection.py:77: UserWarning: redis-py works best with hiredis. Please consider installing
(pid=None)   warnings.warn(msg)
(pid=18476) 2021-12-01 22:57:53,455 WARNING deprecation.py:38 -- DeprecationWarning: `SampleBatch['is_training']` has been deprecated. Use `SampleBatch.is_training` instead. This will raise an error in the future!
(pid=9044) 2021-12-01 22:57:54,972  WARNING deprecation.py:38 -- DeprecationWarning: `SampleBatch['is_training']` has been deprecated. Use `SampleBatch.is_training` instead. This will raise an error in the future!
(pid=9044) 2021-12-01 22:57:56,141  WARNING trainer_template.py:185 -- `execution_plan` functions should accept `trainer`, `workers`, and `config` as args!
(pid=9044) 2021-12-01 22:57:56,141  INFO trainable.py:110 -- Trainable.setup took 18.440 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
(pid=9044) 2021-12-01 22:57:56,141  WARNING util.py:57 -- Install gputil for GPU system monitoring.
(pid=9044) 2021-12-01 22:58:00,922  WARNING deprecation.py:38 -- DeprecationWarning: `slice` has been deprecated. Use `SampleBatch[start:stop]` instead. This will raise an error in the future!
2021-12-01 22:59:20,276 INFO tune.py:630 -- Total run time: 116.62 seconds (116.23 seconds for the tuning loop).
(pid=9044) [2021-12-01 22:59:20,145 E 9044 18960] raylet_client.cc:159: IOError: Unknown error [RayletClient] Failed to disconnect from raylet.
(pid=9044) Windows fatal exception: access violation
(pid=9044) 
(pid=18476) [2021-12-01 22:59:20,149 E 18476 12556] raylet_client.cc:159: IOError: Unknown error [RayletClient] Failed to disconnect from raylet.
(pid=18476) Windows fatal exception: access violation
(pid=18476) 
(pid=16556) [2021-12-01 22:59:20,148 E 16556 1448] raylet_client.cc:159: IOError: Unknown error [RayletClient] Failed to disconnect from raylet.
(pid=16556) Windows fatal exception: access violation
(pid=16556)

== Status ==
Current time: 2021-12-01 22:57:28 (running for 00:00:05.16)
Memory usage on this node: 9.8/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 PENDING)

Trial name	status	loc
PPO_StatelessCartPole_aa22d_00000	PENDING

== Status ==
Current time: 2021-12-01 22:57:56 (running for 00:00:32.50)
Memory usage on this node: 10.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StatelessCartPole_aa22d_00000	RUNNING	127.0.0.1:9044

== Status ==
Current time: 2021-12-01 22:57:57 (running for 00:00:33.51)
Memory usage on this node: 10.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StatelessCartPole_aa22d_00000	RUNNING	127.0.0.1:9044

== Status ==
Current time: 2021-12-01 22:58:02 (running for 00:00:38.63)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StatelessCartPole_aa22d_00000	RUNNING	127.0.0.1:9044

Result for PPO_StatelessCartPole_aa22d_00000:
  agent_timesteps_total: 4000
  custom_metrics: {}
  date: 2021-12-01_22-58-06
  done: false
  episode_len_mean: 22.44632768361582
  episode_media: {}
  episode_reward_max: 85.0
  episode_reward_mean: 22.44632768361582
  episode_reward_min: 8.0
  episodes_this_iter: 177
  episodes_total: 177
  experiment_id: 99df0008334f43779394474d46d27ce1
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6807681322097778
          entropy_coeff: 0.0
          kl: 0.012478094547986984
          model: {}
          policy_loss: -0.02269022725522518
          total_loss: 180.2766876220703
          vf_explained_var: 0.0005618375726044178
          vf_loss: 180.296875
    num_agent_steps_sampled: 4000
    num_agent_steps_trained: 4000
    num_steps_sampled: 4000
    num_steps_trained: 4000
  iterations_since_restore: 1
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 83.74285714285713
    ram_util_percent: 88.00714285714285
  pid: 9044
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.10327919820561605
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.12135616748037466
    mean_inference_ms: 1.9038462404123306
    mean_raw_obs_processing_ms: 0.16696460223270435
  time_since_restore: 10.051005840301514
  time_this_iter_s: 10.051005840301514
  time_total_s: 10.051005840301514
  timers:
    learn_throughput: 757.885
    learn_time_ms: 5277.849
    load_throughput: 0.0
    load_time_ms: 0.0
    sample_throughput: 837.645
    sample_time_ms: 4775.29
    update_time_ms: 5.515
  timestamp: 1638395886
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 4000
  training_iteration: 1
  trial_id: aa22d_00000
  
Result for PPO_StatelessCartPole_aa22d_00000:
  agent_timesteps_total: 8000
  custom_metrics: {}
  date: 2021-12-01_22-58-13
  done: false
  episode_len_mean: 30.083333333333332
  episode_media: {}
  episode_reward_max: 106.0
  episode_reward_mean: 30.083333333333332
  episode_reward_min: 8.0
  episodes_this_iter: 132
  episodes_total: 309
  experiment_id: 99df0008334f43779394474d46d27ce1
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.648204505443573
          entropy_coeff: 0.0
          kl: 0.00953536108136177
          model: {}
          policy_loss: -0.010645464062690735
          total_loss: 191.36209106445312
          vf_explained_var: 0.02945260889828205
          vf_loss: 191.37083435058594
    num_agent_steps_sampled: 8000
    num_agent_steps_trained: 8000
    num_steps_sampled: 8000
    num_steps_trained: 8000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 2
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 73.69090909090909
    ram_util_percent: 88.29090909090907
  pid: 9044
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.09029685607221599
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.1000974098193085
    mean_inference_ms: 1.6932173327936377
    mean_raw_obs_processing_ms: 0.18978260286703064
  time_since_restore: 17.443942308425903
  time_this_iter_s: 7.39293646812439
  time_total_s: 17.443942308425903
  timers:
    learn_throughput: 899.218
    learn_time_ms: 4448.31
    load_throughput: 0.0
    load_time_ms: 0.0
    sample_throughput: 577.076
    sample_time_ms: 6931.495
    update_time_ms: 5.261
  timestamp: 1638395893
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 8000
  training_iteration: 2
  trial_id: aa22d_00000
  
Result for PPO_StatelessCartPole_aa22d_00000:
  agent_timesteps_total: 12000
  custom_metrics: {}
  date: 2021-12-01_22-58-24
  done: false
  episode_len_mean: 37.31481481481482
  episode_media: {}
  episode_reward_max: 143.0
  episode_reward_mean: 37.31481481481482
  episode_reward_min: 9.0
  episodes_this_iter: 108
  episodes_total: 417
  experiment_id: 99df0008334f43779394474d46d27ce1
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6112045049667358
          entropy_coeff: 0.0
          kl: 0.006910913623869419
          model: {}
          policy_loss: -0.015092005021870136
          total_loss: 245.5015411376953
          vf_explained_var: 0.021608643233776093
          vf_loss: 245.5152587890625
    num_agent_steps_sampled: 12000
    num_agent_steps_trained: 12000
    num_steps_sampled: 12000
    num_steps_trained: 12000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 3
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 92.86
    ram_util_percent: 88.22666666666667
  pid: 9044
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.10518439466006081
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.11606732866669264
    mean_inference_ms: 1.8527932982897477
    mean_raw_obs_processing_ms: 0.19247182033120946
  time_since_restore: 28.011292934417725
  time_this_iter_s: 10.567350625991821
  time_total_s: 28.011292934417725
  timers:
    learn_throughput: 856.291
    learn_time_ms: 4671.309
    load_throughput: 11972323.501
    load_time_ms: 0.334
    sample_throughput: 522.229
    sample_time_ms: 7659.481
    update_time_ms: 4.504
  timestamp: 1638395904
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 12000
  training_iteration: 3
  trial_id: aa22d_00000
  
Result for PPO_StatelessCartPole_aa22d_00000:
  agent_timesteps_total: 16000
  custom_metrics: {}
  date: 2021-12-01_22-58-32
  done: false
  episode_len_mean: 42.79
  episode_media: {}
  episode_reward_max: 152.0
  episode_reward_mean: 42.79
  episode_reward_min: 10.0
  episodes_this_iter: 94
  episodes_total: 511
  experiment_id: 99df0008334f43779394474d46d27ce1
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.5654925107955933
          entropy_coeff: 0.0
          kl: 0.004174998961389065
          model: {}
          policy_loss: -0.012154466472566128
          total_loss: 252.0902862548828
          vf_explained_var: 0.03405797854065895
          vf_loss: 252.1016082763672
    num_agent_steps_sampled: 16000
    num_agent_steps_trained: 16000
    num_steps_sampled: 16000
    num_steps_trained: 16000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 4
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 79.5
    ram_util_percent: 86.57272727272728
  pid: 9044
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.10566344965230401
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.12007891816714177
    mean_inference_ms: 1.8777825593616126
    mean_raw_obs_processing_ms: 0.19165146846813094
  time_since_restore: 36.43282437324524
  time_this_iter_s: 8.421531438827515
  time_total_s: 36.43282437324524
  timers:
    learn_throughput: 915.336
    learn_time_ms: 4369.982
    load_throughput: 15963098.002
    load_time_ms: 0.251
    sample_throughput: 483.585
    sample_time_ms: 8271.563
    update_time_ms: 3.884
  timestamp: 1638395912
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 16000
  training_iteration: 4
  trial_id: aa22d_00000
  
Result for PPO_StatelessCartPole_aa22d_00000:
  agent_timesteps_total: 20000
  custom_metrics: {}
  date: 2021-12-01_22-58-39
  done: false
  episode_len_mean: 43.32
  episode_media: {}
  episode_reward_max: 133.0
  episode_reward_mean: 43.32
  episode_reward_min: 11.0
  episodes_this_iter: 90
  episodes_total: 601
  experiment_id: 99df0008334f43779394474d46d27ce1
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.10000000149011612
          cur_lr: 4.999999873689376e-05
          entropy: 0.5446197986602783
          entropy_coeff: 0.0
          kl: 0.004509765654802322
          model: {}
          policy_loss: -0.005693237762898207
          total_loss: 206.11839294433594
          vf_explained_var: 0.1073232963681221
          vf_loss: 206.1236572265625
    num_agent_steps_sampled: 20000
    num_agent_steps_trained: 20000
    num_steps_sampled: 20000
    num_steps_trained: 20000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 5
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 69.06000000000002
    ram_util_percent: 86.75
  pid: 9044
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.10193395594952362
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.11727529060495807
    mean_inference_ms: 1.7989170802559789
    mean_raw_obs_processing_ms: 0.18388387052508867
  time_since_restore: 43.44782853126526
  time_this_iter_s: 7.0150041580200195
  time_total_s: 43.44782853126526
  timers:
    learn_throughput: 957.628
    learn_time_ms: 4176.985
    load_throughput: 19953872.502
    load_time_ms: 0.2
    sample_throughput: 497.425
    sample_time_ms: 8041.409
    update_time_ms: 3.708
  timestamp: 1638395919
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 20000
  training_iteration: 5
  trial_id: aa22d_00000
  
Result for PPO_StatelessCartPole_aa22d_00000:
  agent_timesteps_total: 24000
  custom_metrics: {}
  date: 2021-12-01_22-58-48
  done: false
  episode_len_mean: 47.98
  episode_media: {}
  episode_reward_max: 159.0
  episode_reward_mean: 47.98
  episode_reward_min: 11.0
  episodes_this_iter: 81
  episodes_total: 682
  experiment_id: 99df0008334f43779394474d46d27ce1
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.05000000074505806
          cur_lr: 4.999999873689376e-05
          entropy: 0.5157366991043091
          entropy_coeff: 0.0
          kl: 0.002469780156388879
          model: {}
          policy_loss: -0.00011549380724318326
          total_loss: 223.69801330566406
          vf_explained_var: 0.14510144293308258
          vf_loss: 223.697998046875
    num_agent_steps_sampled: 24000
    num_agent_steps_trained: 24000
    num_steps_sampled: 24000
    num_steps_trained: 24000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 6
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 77.66666666666667
    ram_util_percent: 86.87499999999999
  pid: 9044
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.10315175733432268
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.11747585144831862
    mean_inference_ms: 1.796743386961408
    mean_raw_obs_processing_ms: 0.18361450972900276
  time_since_restore: 52.08714461326599
  time_this_iter_s: 8.639316082000732
  time_total_s: 52.08714461326599
  timers:
    learn_throughput: 963.146
    learn_time_ms: 4153.057
    load_throughput: 23944647.003
    load_time_ms: 0.167
    sample_throughput: 497.199
    sample_time_ms: 8045.069
    update_time_ms: 5.866
  timestamp: 1638395928
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 24000
  training_iteration: 6
  trial_id: aa22d_00000
  
Result for PPO_StatelessCartPole_aa22d_00000:
  agent_timesteps_total: 28000
  custom_metrics: {}
  date: 2021-12-01_22-58-56
  done: false
  episode_len_mean: 50.24
  episode_media: {}
  episode_reward_max: 159.0
  episode_reward_mean: 50.24
  episode_reward_min: 11.0
  episodes_this_iter: 80
  episodes_total: 762
  experiment_id: 99df0008334f43779394474d46d27ce1
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.02500000037252903
          cur_lr: 4.999999873689376e-05
          entropy: 0.4738757014274597
          entropy_coeff: 0.0
          kl: 0.005073909182101488
          model: {}
          policy_loss: 0.00443687941879034
          total_loss: 240.7548065185547
          vf_explained_var: 0.14891892671585083
          vf_loss: 240.75022888183594
    num_agent_steps_sampled: 28000
    num_agent_steps_trained: 28000
    num_steps_sampled: 28000
    num_steps_trained: 28000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 7
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 72.13636363636364
    ram_util_percent: 86.82727272727271
  pid: 9044
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.10310251236597129
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.11948908895760547
    mean_inference_ms: 1.781021410338414
    mean_raw_obs_processing_ms: 0.18258329446260468
  time_since_restore: 59.83448100090027
  time_this_iter_s: 7.747336387634277
  time_total_s: 59.83448100090027
  timers:
    learn_throughput: 983.057
    learn_time_ms: 4068.942
    load_throughput: 27935421.503
    load_time_ms: 0.143
    sample_throughput: 495.114
    sample_time_ms: 8078.951
    update_time_ms: 5.028
  timestamp: 1638395936
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 28000
  training_iteration: 7
  trial_id: aa22d_00000
  
Result for PPO_StatelessCartPole_aa22d_00000:
  agent_timesteps_total: 32000
  custom_metrics: {}
  date: 2021-12-01_22-59-04
  done: false
  episode_len_mean: 50.37
  episode_media: {}
  episode_reward_max: 155.0
  episode_reward_mean: 50.37
  episode_reward_min: 9.0
  episodes_this_iter: 81
  episodes_total: 843
  experiment_id: 99df0008334f43779394474d46d27ce1
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.02500000037252903
          cur_lr: 4.999999873689376e-05
          entropy: 0.44857272505760193
          entropy_coeff: 0.0
          kl: 0.005331501364707947
          model: {}
          policy_loss: -0.00537552684545517
          total_loss: 236.2506103515625
          vf_explained_var: 0.16449585556983948
          vf_loss: 236.25584411621094
    num_agent_steps_sampled: 32000
    num_agent_steps_trained: 32000
    num_steps_sampled: 32000
    num_steps_trained: 32000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 8
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 73.15454545454546
    ram_util_percent: 86.82727272727271
  pid: 9044
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.10669802327244614
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.11682882882097279
    mean_inference_ms: 1.7686794144275058
    mean_raw_obs_processing_ms: 0.1813271875273223
  time_since_restore: 67.72298955917358
  time_this_iter_s: 7.888508558273315
  time_total_s: 67.72298955917358
  timers:
    learn_throughput: 996.781
    learn_time_ms: 4012.918
    load_throughput: 31926196.004
    load_time_ms: 0.125
    sample_throughput: 496.729
    sample_time_ms: 8052.687
    update_time_ms: 4.527
  timestamp: 1638395944
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 32000
  training_iteration: 8
  trial_id: aa22d_00000
  
Result for PPO_StatelessCartPole_aa22d_00000:
  agent_timesteps_total: 36000
  custom_metrics: {}
  date: 2021-12-01_22-59-12
  done: false
  episode_len_mean: 49.87
  episode_media: {}
  episode_reward_max: 110.0
  episode_reward_mean: 49.87
  episode_reward_min: 11.0
  episodes_this_iter: 81
  episodes_total: 924
  experiment_id: 99df0008334f43779394474d46d27ce1
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.02500000037252903
          cur_lr: 4.999999873689376e-05
          entropy: 0.42752259969711304
          entropy_coeff: 0.0
          kl: 0.005028429441154003
          model: {}
          policy_loss: -0.0017633900279179215
          total_loss: 193.06703186035156
          vf_explained_var: 0.2048284411430359
          vf_loss: 193.06866455078125
    num_agent_steps_sampled: 36000
    num_agent_steps_trained: 36000
    num_steps_sampled: 36000
    num_steps_trained: 36000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 9
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 75.8
    ram_util_percent: 86.93333333333334
  pid: 9044
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.10545672991997837
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.11722357639711418
    mean_inference_ms: 1.7522968685832132
    mean_raw_obs_processing_ms: 0.17896026200224338
  time_since_restore: 75.84080076217651
  time_this_iter_s: 8.11781120300293
  time_total_s: 75.84080076217651
  timers:
    learn_throughput: 994.383
    learn_time_ms: 4022.594
    load_throughput: 35916970.504
    load_time_ms: 0.111
    sample_throughput: 499.317
    sample_time_ms: 8010.943
    update_time_ms: 4.024
  timestamp: 1638395952
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 36000
  training_iteration: 9
  trial_id: aa22d_00000
  
Result for PPO_StatelessCartPole_aa22d_00000:
  agent_timesteps_total: 40000
  custom_metrics: {}
  date: 2021-12-01_22-59-19
  done: true
  episode_len_mean: 46.75
  episode_media: {}
  episode_reward_max: 125.0
  episode_reward_mean: 46.75
  episode_reward_min: 13.0
  episodes_this_iter: 84
  episodes_total: 1008
  experiment_id: 99df0008334f43779394474d46d27ce1
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.02500000037252903
          cur_lr: 4.999999873689376e-05
          entropy: 0.45696982741355896
          entropy_coeff: 0.0
          kl: 0.0024534217081964016
          model: {}
          policy_loss: 0.0026924554258584976
          total_loss: 184.4345245361328
          vf_explained_var: 0.2629404664039612
          vf_loss: 184.43174743652344
    num_agent_steps_sampled: 40000
    num_agent_steps_trained: 40000
    num_steps_sampled: 40000
    num_steps_trained: 40000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 10
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 70.31
    ram_util_percent: 86.9
  pid: 9044
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.10223809426816181
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.11759765514824477
    mean_inference_ms: 1.7250900206764819
    mean_raw_obs_processing_ms: 0.17782050917712555
  time_since_restore: 83.26664853096008
  time_this_iter_s: 7.425847768783569
  time_total_s: 83.26664853096008
  timers:
    learn_throughput: 1002.298
    learn_time_ms: 3990.829
    load_throughput: 7987248.75
    load_time_ms: 0.501
    sample_throughput: 500.117
    sample_time_ms: 7998.131
    update_time_ms: 3.621
  timestamp: 1638395959
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 40000
  training_iteration: 10
  trial_id: aa22d_00000
  
Option 2: Training finished successfully

== Status ==
Current time: 2021-12-01 22:58:08 (running for 00:00:44.63)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_aa22d_00000	RUNNING	127.0.0.1:9044	1	10.051	4000	22.4463	85	8	22.4463

== Status ==
Current time: 2021-12-01 22:58:13 (running for 00:00:49.73)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_aa22d_00000	RUNNING	127.0.0.1:9044	1	10.051	4000	22.4463	85	8	22.4463

== Status ==
Current time: 2021-12-01 22:58:18 (running for 00:00:55.09)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_aa22d_00000	RUNNING	127.0.0.1:9044	2	17.4439	8000	30.0833	106	8	30.0833

== Status ==
Current time: 2021-12-01 22:58:23 (running for 00:01:00.16)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_aa22d_00000	RUNNING	127.0.0.1:9044	2	17.4439	8000	30.0833	106	8	30.0833

== Status ==
Current time: 2021-12-01 22:58:29 (running for 00:01:05.71)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_aa22d_00000	RUNNING	127.0.0.1:9044	3	28.0113	12000	37.3148	143	9	37.3148

== Status ==
Current time: 2021-12-01 22:58:34 (running for 00:01:11.09)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_aa22d_00000	RUNNING	127.0.0.1:9044	4	36.4328	16000	42.79	152	10	42.79

== Status ==
Current time: 2021-12-01 22:58:39 (running for 00:01:16.14)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_aa22d_00000	RUNNING	127.0.0.1:9044	5	43.4478	20000	43.32	133	11	43.32

== Status ==
Current time: 2021-12-01 22:58:44 (running for 00:01:21.20)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_aa22d_00000	RUNNING	127.0.0.1:9044	5	43.4478	20000	43.32	133	11	43.32

== Status ==
Current time: 2021-12-01 22:58:50 (running for 00:01:26.88)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_aa22d_00000	RUNNING	127.0.0.1:9044	6	52.0871	24000	47.98	159	11	47.98

== Status ==
Current time: 2021-12-01 22:58:55 (running for 00:01:31.95)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_aa22d_00000	RUNNING	127.0.0.1:9044	6	52.0871	24000	47.98	159	11	47.98

== Status ==
Current time: 2021-12-01 22:59:01 (running for 00:01:37.71)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_aa22d_00000	RUNNING	127.0.0.1:9044	7	59.8345	28000	50.24	159	11	50.24

== Status ==
Current time: 2021-12-01 22:59:07 (running for 00:01:43.60)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_aa22d_00000	RUNNING	127.0.0.1:9044	8	67.723	32000	50.37	155	9	50.37

== Status ==
Current time: 2021-12-01 22:59:12 (running for 00:01:48.70)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_aa22d_00000	RUNNING	127.0.0.1:9044	8	67.723	32000	50.37	155	9	50.37

== Status ==
Current time: 2021-12-01 22:59:17 (running for 00:01:53.79)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_aa22d_00000	RUNNING	127.0.0.1:9044	9	75.8408	36000	49.87	110	11	49.87

== Status ==
Current time: 2021-12-01 22:59:19 (running for 00:01:56.27)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 TERMINATED)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_aa22d_00000	TERMINATED	127.0.0.1:9044	10	83.2666	40000	46.75	125	13	46.75

print_reward(results2)

Reward after 10 training iterations: 46.75

plot_rewards(results2)

c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

# compare learning curves
plot_learning(results1, label="1: Full Observations")
plot_learning(results2, label="2: Partial Observations")

With only the partial observations, i.e., without observing velocity, the RL agent does not learn a useful policy. The reward does not increase notably over time and the resulting episode reward is much smaller than with full obsevations.

Option 3: Use Sequence of Last Observations

Even if the velocity of cart and pole are not explicitly available in this example, it can be derived by the RL agent by looking at a sequence of previous observations. If the cart is always at the same position, its velocity is likely close to zero. If its position varies greatly, it likely has high velocity.

Hence, one useful approach is to simply stack the last \(n\) observations and providing this sequence as input to the RL agent.

Option 3a: Use Raw Sequence as Input

Here, I consider the same default feed-forward neural network with PPO, just providing the stacked, partial observations as input.

Stacking Observations Using Gym’s `FrameStack` Wrapper

To stack the last \(n\) observations, I use Gym’s FrameStack wrapper. As an example, I choose \(n=4\).

from gym.wrappers import FrameStack

NUM_FRAMES = 4

# stateless CartPole --> only 2 observations: position of cart & angle of pole (not: velocity of cart or pole)
env = StatelessCartPole()
print(f"Shape of observation space (stateless CartPole): {env.observation_space.shape}")

# stack last n observations into sequence --> n x 2
env_stacked = FrameStack(env, NUM_FRAMES)
print(f"Shape of observation space (stacked stateless CartPole): {env_stacked.observation_space.shape}")

# register env for RLlib
registry.register_env("StackedStatelessCartPole", lambda _: FrameStack(StatelessCartPole(), NUM_FRAMES))

Shape of observation space (stateless CartPole): (2,)
Shape of observation space (stacked stateless CartPole): (4, 2)

#collapse-output

# use PPO with vanilla MLP
config3a = ppo.DEFAULT_CONFIG.copy()
config3a["env"] = "StackedStatelessCartPole"
# train; this takes a while
results3a = ray.tune.run("PPO", config=config3a, stop=stop)
print("Option 3a with FrameStack: Training finished successfully")

== Status ==
Current time: 2021-12-01 23:02:44 (running for 00:00:00.15)
Memory usage on this node: 9.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 PENDING)

Trial name	status	loc
PPO_StackedStatelessCartPole_69565_00000	PENDING

(pid=None) c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\redis\connection.py:77: UserWarning: redis-py works best with hiredis. Please consider installing
(pid=None)   warnings.warn(msg)
(pid=13456) 2021-12-01 23:03:00,839 INFO trainer.py:753 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
(pid=13456) 2021-12-01 23:03:00,839 INFO ppo.py:166 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
(pid=13456) 2021-12-01 23:03:00,839 INFO trainer.py:770 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(pid=None) c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\redis\connection.py:77: UserWarning: redis-py works best with hiredis. Please consider installing
(pid=None)   warnings.warn(msg)
(pid=None) c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\redis\connection.py:77: UserWarning: redis-py works best with hiredis. Please consider installing
(pid=None)   warnings.warn(msg)
(pid=19996) c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\gym\spaces\box.py:142: UserWarning: WARN: Casting input x to numpy array.
(pid=19996)   logger.warn("Casting input x to numpy array.")
(pid=3484) c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\gym\spaces\box.py:142: UserWarning: WARN: Casting input x to numpy array.
(pid=3484)   logger.warn("Casting input x to numpy array.")
(pid=3484) 2021-12-01 23:03:17,245  WARNING deprecation.py:38 -- DeprecationWarning: `SampleBatch['is_training']` has been deprecated. Use `SampleBatch.is_training` instead. This will raise an error in the future!
(pid=13456) 2021-12-01 23:03:19,489 WARNING deprecation.py:38 -- DeprecationWarning: `SampleBatch['is_training']` has been deprecated. Use `SampleBatch.is_training` instead. This will raise an error in the future!
(pid=13456) 2021-12-01 23:03:20,834 WARNING trainer_template.py:185 -- `execution_plan` functions should accept `trainer`, `workers`, and `config` as args!
(pid=13456) 2021-12-01 23:03:20,834 INFO trainable.py:110 -- Trainable.setup took 19.995 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
(pid=13456) 2021-12-01 23:03:20,839 WARNING util.py:57 -- Install gputil for GPU system monitoring.
(pid=13456) 2021-12-01 23:03:26,389 WARNING deprecation.py:38 -- DeprecationWarning: `slice` has been deprecated. Use `SampleBatch[start:stop]` instead. This will raise an error in the future!
2021-12-01 23:04:57,071 INFO tune.py:630 -- Total run time: 132.65 seconds (132.43 seconds for the tuning loop).ayletClient] Failed to disconnect from raylet.

(pid=13456) Windows fatal exception: access violation
(pid=13456) 
(pid=19996) [2021-12-01 23:04:56,970 E 19996 3460] raylet_client.cc:159: IOError: Unknown error [RayletClient] Failed to disconnect from raylet.
(pid=19996) Windows fatal exception: access violation
(pid=19996) 
(pid=3484) [2021-12-01 23:04:56,971 E 3484 17628] raylet_client.cc:159: IOError: Unknown error [RayletClient] Failed to disconnect from raylet.
(pid=3484) Windows fatal exception: access violation
(pid=3484)

== Status ==
Current time: 2021-12-01 23:02:49 (running for 00:00:05.15)
Memory usage on this node: 9.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 PENDING)

Trial name	status	loc
PPO_StackedStatelessCartPole_69565_00000	PENDING

== Status ==
Current time: 2021-12-01 23:03:20 (running for 00:00:36.40)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StackedStatelessCartPole_69565_00000	RUNNING	127.0.0.1:13456

== Status ==
Current time: 2021-12-01 23:03:21 (running for 00:00:37.50)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StackedStatelessCartPole_69565_00000	RUNNING	127.0.0.1:13456

== Status ==
Current time: 2021-12-01 23:03:28 (running for 00:00:43.61)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StackedStatelessCartPole_69565_00000	RUNNING	127.0.0.1:13456

Result for PPO_StackedStatelessCartPole_69565_00000:
  agent_timesteps_total: 4000
  custom_metrics: {}
  date: 2021-12-01_23-03-31
  done: false
  episode_len_mean: 20.91578947368421
  episode_media: {}
  episode_reward_max: 61.0
  episode_reward_mean: 20.91578947368421
  episode_reward_min: 8.0
  episodes_this_iter: 190
  episodes_total: 190
  experiment_id: dad4489332ba46c8ab9c9ed834879afb
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6832552552223206
          entropy_coeff: 0.0
          kl: 0.010132171213626862
          model: {}
          policy_loss: -0.017918335273861885
          total_loss: 126.10237121582031
          vf_explained_var: 0.01804439164698124
          vf_loss: 126.1182632446289
    num_agent_steps_sampled: 4000
    num_agent_steps_trained: 4000
    num_steps_sampled: 4000
    num_steps_trained: 4000
  iterations_since_restore: 1
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 80.80666666666666
    ram_util_percent: 85.86666666666669
  pid: 13456
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.11120850849435063
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.18972639138395245
    mean_inference_ms: 2.0335766275739453
    mean_raw_obs_processing_ms: 0.3751123260983276
  time_since_restore: 10.363371133804321
  time_this_iter_s: 10.363371133804321
  time_total_s: 10.363371133804321
  timers:
    learn_throughput: 833.897
    learn_time_ms: 4796.753
    load_throughput: 0.0
    load_time_ms: 0.0
    sample_throughput: 719.056
    sample_time_ms: 5562.852
    update_time_ms: 5.016
  timestamp: 1638396211
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 4000
  training_iteration: 1
  trial_id: '69565_00000'
  
Result for PPO_StackedStatelessCartPole_69565_00000:
  agent_timesteps_total: 8000
  custom_metrics: {}
  date: 2021-12-01_23-03-40
  done: false
  episode_len_mean: 29.455882352941178
  episode_media: {}
  episode_reward_max: 136.0
  episode_reward_mean: 29.455882352941178
  episode_reward_min: 8.0
  episodes_this_iter: 136
  episodes_total: 326
  experiment_id: dad4489332ba46c8ab9c9ed834879afb
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6507855653762817
          entropy_coeff: 0.0
          kl: 0.010778849013149738
          model: {}
          policy_loss: -0.01948031783103943
          total_loss: 162.9302215576172
          vf_explained_var: 0.03349286690354347
          vf_loss: 162.94754028320312
    num_agent_steps_sampled: 8000
    num_agent_steps_trained: 8000
    num_steps_sampled: 8000
    num_steps_trained: 8000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 2
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 84.03076923076924
    ram_util_percent: 85.8076923076923
  pid: 13456
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.10394749777843926
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.15974853905888772
    mean_inference_ms: 1.9997662326250156
    mean_raw_obs_processing_ms: 0.3263675950258694
  time_since_restore: 20.015674591064453
  time_this_iter_s: 9.652303457260132
  time_total_s: 20.015674591064453
  timers:
    learn_throughput: 849.072
    learn_time_ms: 4711.025
    load_throughput: 7966389.364
    load_time_ms: 0.502
    sample_throughput: 518.103
    sample_time_ms: 7720.474
    update_time_ms: 4.509
  timestamp: 1638396220
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 8000
  training_iteration: 2
  trial_id: '69565_00000'
  
Result for PPO_StackedStatelessCartPole_69565_00000:
  agent_timesteps_total: 12000
  custom_metrics: {}
  date: 2021-12-01_23-03-49
  done: false
  episode_len_mean: 45.47
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 45.47
  episode_reward_min: 9.0
  episodes_this_iter: 83
  episodes_total: 409
  experiment_id: dad4489332ba46c8ab9c9ed834879afb
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6116749048233032
          entropy_coeff: 0.0
          kl: 0.0092014130204916
          model: {}
          policy_loss: -0.017703521996736526
          total_loss: 372.5731201171875
          vf_explained_var: 0.04724571481347084
          vf_loss: 372.5889892578125
    num_agent_steps_sampled: 12000
    num_agent_steps_trained: 12000
    num_steps_sampled: 12000
    num_steps_trained: 12000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 3
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 79.78333333333333
    ram_util_percent: 85.93333333333334
  pid: 13456
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.09840720456414369
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.15510985166002864
    mean_inference_ms: 1.9296228234529118
    mean_raw_obs_processing_ms: 0.29812654395886573
  time_since_restore: 28.900269746780396
  time_this_iter_s: 8.884595155715942
  time_total_s: 28.900269746780396
  timers:
    learn_throughput: 868.178
    learn_time_ms: 4607.352
    load_throughput: 5989010.947
    load_time_ms: 0.668
    sample_throughput: 488.004
    sample_time_ms: 8196.646
    update_time_ms: 4.004
  timestamp: 1638396229
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 12000
  training_iteration: 3
  trial_id: '69565_00000'
  
Result for PPO_StackedStatelessCartPole_69565_00000:
  agent_timesteps_total: 16000
  custom_metrics: {}
  date: 2021-12-01_23-04-00
  done: false
  episode_len_mean: 63.03
  episode_media: {}
  episode_reward_max: 272.0
  episode_reward_mean: 63.03
  episode_reward_min: 13.0
  episodes_this_iter: 51
  episodes_total: 460
  experiment_id: dad4489332ba46c8ab9c9ed834879afb
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.5801236629486084
          entropy_coeff: 0.0
          kl: 0.006844064686447382
          model: {}
          policy_loss: -0.009995924308896065
          total_loss: 404.9743957519531
          vf_explained_var: 0.09591271728277206
          vf_loss: 404.9830627441406
    num_agent_steps_sampled: 16000
    num_agent_steps_trained: 16000
    num_steps_sampled: 16000
    num_steps_trained: 16000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 4
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 88.04
    ram_util_percent: 85.95333333333335
  pid: 13456
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.0999835854110982
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.15354213307036435
    mean_inference_ms: 1.9525573766667057
    mean_raw_obs_processing_ms: 0.29008862922916356
  time_since_restore: 39.33686113357544
  time_this_iter_s: 10.436591386795044
  time_total_s: 39.33686113357544
  timers:
    learn_throughput: 860.917
    learn_time_ms: 4646.208
    load_throughput: 7985347.93
    load_time_ms: 0.501
    sample_throughput: 461.014
    sample_time_ms: 8676.526
    update_time_ms: 3.003
  timestamp: 1638396240
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 16000
  training_iteration: 4
  trial_id: '69565_00000'
  
Result for PPO_StackedStatelessCartPole_69565_00000:
  agent_timesteps_total: 20000
  custom_metrics: {}
  date: 2021-12-01_23-04-09
  done: false
  episode_len_mean: 83.03
  episode_media: {}
  episode_reward_max: 272.0
  episode_reward_mean: 83.03
  episode_reward_min: 10.0
  episodes_this_iter: 41
  episodes_total: 501
  experiment_id: dad4489332ba46c8ab9c9ed834879afb
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.5570896863937378
          entropy_coeff: 0.0
          kl: 0.00568711943924427
          model: {}
          policy_loss: -0.01232148241251707
          total_loss: 408.25262451171875
          vf_explained_var: 0.11371473968029022
          vf_loss: 408.26385498046875
    num_agent_steps_sampled: 20000
    num_agent_steps_trained: 20000
    num_steps_sampled: 20000
    num_steps_trained: 20000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 5
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 79.44615384615385
    ram_util_percent: 85.80769230769229
  pid: 13456
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.10660624354014989
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.149048118877094
    mean_inference_ms: 1.9695397741116307
    mean_raw_obs_processing_ms: 0.28367179033707196
  time_since_restore: 48.41154980659485
  time_this_iter_s: 9.07468867301941
  time_total_s: 48.41154980659485
  timers:
    learn_throughput: 873.938
    learn_time_ms: 4576.986
    load_throughput: 9981684.912
    load_time_ms: 0.401
    sample_throughput: 451.554
    sample_time_ms: 8858.292
    update_time_ms: 2.403
  timestamp: 1638396249
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 20000
  training_iteration: 5
  trial_id: '69565_00000'
  
Result for PPO_StackedStatelessCartPole_69565_00000:
  agent_timesteps_total: 24000
  custom_metrics: {}
  date: 2021-12-01_23-04-19
  done: false
  episode_len_mean: 102.34
  episode_media: {}
  episode_reward_max: 304.0
  episode_reward_mean: 102.34
  episode_reward_min: 10.0
  episodes_this_iter: 24
  episodes_total: 525
  experiment_id: dad4489332ba46c8ab9c9ed834879afb
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.5555164217948914
          entropy_coeff: 0.0
          kl: 0.004422472789883614
          model: {}
          policy_loss: -0.008869567885994911
          total_loss: 570.493896484375
          vf_explained_var: 0.24427081644535065
          vf_loss: 570.5018920898438
    num_agent_steps_sampled: 24000
    num_agent_steps_trained: 24000
    num_steps_sampled: 24000
    num_steps_trained: 24000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 6
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 85.06923076923076
    ram_util_percent: 85.82307692307693
  pid: 13456
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.10851443555266879
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.14933858568901648
    mean_inference_ms: 1.9681416365815767
    mean_raw_obs_processing_ms: 0.28052371436443274
  time_since_restore: 57.97208309173584
  time_this_iter_s: 9.560533285140991
  time_total_s: 57.97208309173584
  timers:
    learn_throughput: 878.415
    learn_time_ms: 4553.655
    load_throughput: 11978021.894
    load_time_ms: 0.334
    sample_throughput: 446.643
    sample_time_ms: 8955.693
    update_time_ms: 2.669
  timestamp: 1638396259
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 24000
  training_iteration: 6
  trial_id: '69565_00000'
  
Result for PPO_StackedStatelessCartPole_69565_00000:
  agent_timesteps_total: 28000
  custom_metrics: {}
  date: 2021-12-01_23-04-27
  done: false
  episode_len_mean: 127.8
  episode_media: {}
  episode_reward_max: 321.0
  episode_reward_mean: 127.8
  episode_reward_min: 10.0
  episodes_this_iter: 23
  episodes_total: 548
  experiment_id: dad4489332ba46c8ab9c9ed834879afb
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.10000000149011612
          cur_lr: 4.999999873689376e-05
          entropy: 0.5434969067573547
          entropy_coeff: 0.0
          kl: 0.008256432600319386
          model: {}
          policy_loss: -0.0062043326906859875
          total_loss: 453.47607421875
          vf_explained_var: 0.3077850043773651
          vf_loss: 453.4814758300781
    num_agent_steps_sampled: 28000
    num_agent_steps_trained: 28000
    num_steps_sampled: 28000
    num_steps_trained: 28000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 7
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 79.55833333333334
    ram_util_percent: 85.89166666666667
  pid: 13456
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.110027565885569
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.14780355323603098
    mean_inference_ms: 1.9526956031732303
    mean_raw_obs_processing_ms: 0.27615972972533087
  time_since_restore: 66.65037989616394
  time_this_iter_s: 8.6782968044281
  time_total_s: 66.65037989616394
  timers:
    learn_throughput: 887.707
    learn_time_ms: 4505.99
    load_throughput: 13974358.877
    load_time_ms: 0.286
    sample_throughput: 446.514
    sample_time_ms: 8958.288
    update_time_ms: 2.859
  timestamp: 1638396267
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 28000
  training_iteration: 7
  trial_id: '69565_00000'
  
Result for PPO_StackedStatelessCartPole_69565_00000:
  agent_timesteps_total: 32000
  custom_metrics: {}
  date: 2021-12-01_23-04-38
  done: false
  episode_len_mean: 145.51
  episode_media: {}
  episode_reward_max: 392.0
  episode_reward_mean: 145.51
  episode_reward_min: 10.0
  episodes_this_iter: 26
  episodes_total: 574
  experiment_id: dad4489332ba46c8ab9c9ed834879afb
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.10000000149011612
          cur_lr: 4.999999873689376e-05
          entropy: 0.5463652014732361
          entropy_coeff: 0.0
          kl: 0.010875530540943146
          model: {}
          policy_loss: -0.007964679040014744
          total_loss: 391.5842590332031
          vf_explained_var: 0.35197633504867554
          vf_loss: 391.5911865234375
    num_agent_steps_sampled: 32000
    num_agent_steps_trained: 32000
    num_steps_sampled: 32000
    num_steps_trained: 32000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 8
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 87.83571428571429
    ram_util_percent: 85.70000000000002
  pid: 13456
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.11137697557468901
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.14710154952868693
    mean_inference_ms: 1.9442323273031195
    mean_raw_obs_processing_ms: 0.27244073066886854
  time_since_restore: 76.93640422821045
  time_this_iter_s: 10.286024332046509
  time_total_s: 76.93640422821045
  timers:
    learn_throughput: 875.973
    learn_time_ms: 4566.348
    load_throughput: 15970695.859
    load_time_ms: 0.25
    sample_throughput: 442.806
    sample_time_ms: 9033.302
    update_time_ms: 2.879
  timestamp: 1638396278
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 32000
  training_iteration: 8
  trial_id: '69565_00000'
  
Result for PPO_StackedStatelessCartPole_69565_00000:
  agent_timesteps_total: 36000
  custom_metrics: {}
  date: 2021-12-01_23-04-48
  done: false
  episode_len_mean: 164.14
  episode_media: {}
  episode_reward_max: 392.0
  episode_reward_mean: 164.14
  episode_reward_min: 13.0
  episodes_this_iter: 20
  episodes_total: 594
  experiment_id: dad4489332ba46c8ab9c9ed834879afb
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.10000000149011612
          cur_lr: 4.999999873689376e-05
          entropy: 0.5408887267112732
          entropy_coeff: 0.0
          kl: 0.011410081759095192
          model: {}
          policy_loss: -0.013954582624137402
          total_loss: 419.84454345703125
          vf_explained_var: 0.34064534306526184
          vf_loss: 419.8573913574219
    num_agent_steps_sampled: 36000
    num_agent_steps_trained: 36000
    num_steps_sampled: 36000
    num_steps_trained: 36000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 9
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 84.79285714285716
    ram_util_percent: 85.70000000000003
  pid: 13456
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.1111877406329047
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.14667458354156745
    mean_inference_ms: 1.9476374535757386
    mean_raw_obs_processing_ms: 0.2706214096819341
  time_since_restore: 86.94219470024109
  time_this_iter_s: 10.00579047203064
  time_total_s: 86.94219470024109
  timers:
    learn_throughput: 881.266
    learn_time_ms: 4538.926
    load_throughput: 5140953.457
    load_time_ms: 0.778
    sample_throughput: 433.901
    sample_time_ms: 9218.702
    update_time_ms: 2.559
  timestamp: 1638396288
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 36000
  training_iteration: 9
  trial_id: '69565_00000'
  
Result for PPO_StackedStatelessCartPole_69565_00000:
  agent_timesteps_total: 40000
  custom_metrics: {}
  date: 2021-12-01_23-04-56
  done: true
  episode_len_mean: 176.48
  episode_media: {}
  episode_reward_max: 392.0
  episode_reward_mean: 176.48
  episode_reward_min: 13.0
  episodes_this_iter: 19
  episodes_total: 613
  experiment_id: dad4489332ba46c8ab9c9ed834879afb
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.10000000149011612
          cur_lr: 4.999999873689376e-05
          entropy: 0.5288471579551697
          entropy_coeff: 0.0
          kl: 0.008251729421317577
          model: {}
          policy_loss: -0.007646023295819759
          total_loss: 273.2549743652344
          vf_explained_var: 0.5383354425430298
          vf_loss: 273.2617492675781
    num_agent_steps_sampled: 40000
    num_agent_steps_trained: 40000
    num_steps_sampled: 40000
    num_steps_trained: 40000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 10
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 80.93333333333332
    ram_util_percent: 86.14999999999999
  pid: 13456
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.11016966943400075
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.14657990293613932
    mean_inference_ms: 1.948797938913841
    mean_raw_obs_processing_ms: 0.2673978140474405
  time_since_restore: 95.56249117851257
  time_this_iter_s: 8.620296478271484
  time_total_s: 95.56249117851257
  timers:
    learn_throughput: 893.185
    learn_time_ms: 4478.354
    load_throughput: 5712170.508
    load_time_ms: 0.7
    sample_throughput: 434.433
    sample_time_ms: 9207.403
    update_time_ms: 2.303
  timestamp: 1638396296
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 40000
  training_iteration: 10
  trial_id: '69565_00000'
  
Option 3a with FrameStack: Training finished successfully

== Status ==
Current time: 2021-12-01 23:03:33 (running for 00:00:48.84)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_69565_00000	RUNNING	127.0.0.1:13456	1	10.3634	4000	20.9158	61	8	20.9158

== Status ==
Current time: 2021-12-01 23:03:38 (running for 00:00:53.93)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_69565_00000	RUNNING	127.0.0.1:13456	1	10.3634	4000	20.9158	61	8	20.9158

== Status ==
Current time: 2021-12-01 23:03:44 (running for 00:00:59.57)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_69565_00000	RUNNING	127.0.0.1:13456	2	20.0157	8000	29.4559	136	8	29.4559

== Status ==
Current time: 2021-12-01 23:03:49 (running for 00:01:04.66)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_69565_00000	RUNNING	127.0.0.1:13456	2	20.0157	8000	29.4559	136	8	29.4559

== Status ==
Current time: 2021-12-01 23:03:55 (running for 00:01:10.55)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_69565_00000	RUNNING	127.0.0.1:13456	3	28.9003	12000	45.47	200	9	45.47

== Status ==
Current time: 2021-12-01 23:04:00 (running for 00:01:15.68)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_69565_00000	RUNNING	127.0.0.1:13456	3	28.9003	12000	45.47	200	9	45.47

== Status ==
Current time: 2021-12-01 23:04:05 (running for 00:01:21.06)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_69565_00000	RUNNING	127.0.0.1:13456	4	39.3369	16000	63.03	272	13	63.03

== Status ==
Current time: 2021-12-01 23:04:11 (running for 00:01:27.06)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_69565_00000	RUNNING	127.0.0.1:13456	5	48.4115	20000	83.03	272	10	83.03

== Status ==
Current time: 2021-12-01 23:04:16 (running for 00:01:32.16)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_69565_00000	RUNNING	127.0.0.1:13456	5	48.4115	20000	83.03	272	10	83.03

== Status ==
Current time: 2021-12-01 23:04:22 (running for 00:01:37.68)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_69565_00000	RUNNING	127.0.0.1:13456	6	57.9721	24000	102.34	304	10	102.34

== Status ==
Current time: 2021-12-01 23:04:27 (running for 00:01:42.89)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_69565_00000	RUNNING	127.0.0.1:13456	6	57.9721	24000	102.34	304	10	102.34

== Status ==
Current time: 2021-12-01 23:04:32 (running for 00:01:48.46)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_69565_00000	RUNNING	127.0.0.1:13456	7	66.6504	28000	127.8	321	10	127.8

== Status ==
Current time: 2021-12-01 23:04:38 (running for 00:01:53.54)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_69565_00000	RUNNING	127.0.0.1:13456	7	66.6504	28000	127.8	321	10	127.8

== Status ==
Current time: 2021-12-01 23:04:43 (running for 00:01:58.76)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_69565_00000	RUNNING	127.0.0.1:13456	8	76.9364	32000	145.51	392	10	145.51

== Status ==
Current time: 2021-12-01 23:04:49 (running for 00:02:04.76)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_69565_00000	RUNNING	127.0.0.1:13456	9	86.9422	36000	164.14	392	13	164.14

== Status ==
Current time: 2021-12-01 23:04:56 (running for 00:02:11.83)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_69565_00000	RUNNING	127.0.0.1:13456	9	86.9422	36000	164.14	392	13	164.14

== Status ==
Current time: 2021-12-01 23:04:56 (running for 00:02:12.48)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 TERMINATED)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_69565_00000	TERMINATED	127.0.0.1:13456	10	95.5625	40000	176.48	392	13	176.48

print_reward(results3a)

Reward after 10 training iterations: 176.48

plot_rewards(results3a)

c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

plot_learning(results1, label="1: Full Observations")
plot_learning(results2, label="2: Partial Observations")
plot_learning(results3a, label="3a: Stacked, Partial Observations")

Simply by stacking the last \(n\) observations, the RL agent learns a useful policy again - even though each observation is still partial, i.e., missing the cart and pole velocity.

As you can see in the learning curves, the agent learns a bit slower than with full observations but still much faster than the agent with only a single partial observation (which does not really learn at all).

Stacking Observations Using RLlib’s Trajectory API

Above, I used Gym’s FrameStack wrapper to stack the last \(n\) observations inside the environment. Alternatively, the stacking can be implemented on the model side, e.g., using RLlib’s trajectory API, which reduces space complexity for storing the stacked observations but should lead to similar results.

#collapse-output

from ray.rllib.examples.models.trajectory_view_utilizing_models import FrameStackingCartPoleModel
from ray.rllib.models.catalog import ModelCatalog

ModelCatalog.register_custom_model("stacking_model", FrameStackingCartPoleModel)

config3a2 = ppo.DEFAULT_CONFIG.copy()
config3a2["env"] = "StatelessCartPole"
config3a2["model"] = {
    "custom_model": "stacking_model",
    "custom_model_config": {
        "num_frames": NUM_FRAMES,
    }
}

results3a2 = ray.tune.run("PPO", config=config3a2, stop=stop)
print("Option 3a2 with Trajectory API: Training finished successfully")

== Status ==
Current time: 2021-12-01 23:11:27 (running for 00:00:00.14)
Memory usage on this node: 9.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 PENDING)

Trial name	status	loc
PPO_StatelessCartPole_a1402_00000	PENDING

(pid=None) c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\redis\connection.py:77: UserWarning: redis-py works best with hiredis. Please consider installing
(pid=None)   warnings.warn(msg)
(pid=7032) 2021-12-01 23:11:41,672  INFO trainer.py:753 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
(pid=7032) 2021-12-01 23:11:41,672  INFO ppo.py:166 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
(pid=7032) 2021-12-01 23:11:41,672  INFO trainer.py:770 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(pid=None) c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\redis\connection.py:77: UserWarning: redis-py works best with hiredis. Please consider installing
(pid=None)   warnings.warn(msg)
(pid=None) c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\redis\connection.py:77: UserWarning: redis-py works best with hiredis. Please consider installing
(pid=None)   warnings.warn(msg)
(pid=20056) 2021-12-01 23:11:58,655 WARNING deprecation.py:38 -- DeprecationWarning: `SampleBatch['is_training']` has been deprecated. Use `SampleBatch.is_training` instead. This will raise an error in the future!
(pid=7032) 2021-12-01 23:12:00,488  WARNING deprecation.py:38 -- DeprecationWarning: `SampleBatch['is_training']` has been deprecated. Use `SampleBatch.is_training` instead. This will raise an error in the future!
(pid=7032) 2021-12-01 23:12:01,689  WARNING trainer_template.py:185 -- `execution_plan` functions should accept `trainer`, `workers`, and `config` as args!
(pid=7032) 2021-12-01 23:12:01,689  INFO trainable.py:110 -- Trainable.setup took 20.017 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
(pid=7032) 2021-12-01 23:12:01,689  WARNING util.py:57 -- Install gputil for GPU system monitoring.
(pid=7032) 2021-12-01 23:12:08,389  WARNING deprecation.py:38 -- DeprecationWarning: `slice` has been deprecated. Use `SampleBatch[start:stop]` instead. This will raise an error in the future!
(pid=7032) Windows fatal exception: access violation
(pid=7032) 
(pid=20056) [2021-12-01 23:13:35,836 C 20056 17856] core_worker.cc:796:  Check failed: _s.ok() Bad status: IOError: Unknown error
(pid=20056) *** StackTrace Information ***
(pid=20056)     PyInit__raylet
(pid=20056)     PyInit__raylet
(pid=20056)     PyInit__raylet
(pid=20056)     PyInit__raylet
(pid=20056)     PyInit__raylet
(pid=20056)     PyInit__raylet
(pid=20056)     PyInit__raylet
(pid=20056)     PyInit__raylet
(pid=20056)     PyInit__raylet
(pid=20056)     PyInit__raylet
(pid=20056)     PyInit__raylet
(pid=20056)     PyInit__raylet
(pid=20056)     PyInit__raylet
(pid=20056)     PyInit__raylet
(pid=20056)     PyInit__raylet
(pid=20056)     PyInit__raylet
(pid=20056)     PyNumber_InPlaceLshift
(pid=20056)     Py_CheckFunctionResult
(pid=20056)     PyEval_EvalFrameDefault
(pid=20056)     Py_CheckFunctionResult
(pid=20056)     PyEval_EvalFrameDefault
(pid=20056)     PyEval_EvalCodeWithName
(pid=20056)     PyEval_EvalCodeEx
(pid=20056)     PyEval_EvalCode
(pid=20056)     PyArena_New
(pid=20056)     PyArena_New
(pid=20056)     PyRun_FileExFlags
(pid=20056)     PyRun_SimpleFileExFlags
(pid=20056)     PyRun_AnyFileExFlags
(pid=20056)     Py_FatalError
(pid=20056)     Py_RunMain
(pid=20056)     Py_RunMain
(pid=20056)     Py_Main
(pid=20056)     BaseThreadInitThunk
(pid=20056)     RtlUserThreadStart
(pid=20056) 
(pid=20056) Windows fatal exception: access violation
(pid=20056) 
(pid=20056) Stack (most recent call first):
(pid=20056)   File "c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\ray\worker.py", line 425 in main_loop
(pid=20056)   File "c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\ray\workers/default_worker.py", line 218 in <module>
(pid=6936) [2021-12-01 23:13:35,836 C 6936 7600] core_worker.cc:796:  Check failed: _s.ok() Bad status: IOError: Unknown error
(pid=6936) *** StackTrace Information ***
(pid=6936)     PyInit__raylet
(pid=6936)     PyInit__raylet
(pid=6936)     PyInit__raylet
(pid=6936)     PyInit__raylet
(pid=6936)     PyInit__raylet
(pid=6936)     PyInit__raylet
(pid=6936)     PyInit__raylet
(pid=6936)     PyInit__raylet
(pid=6936)     PyInit__raylet
(pid=6936)     PyInit__raylet
(pid=6936)     PyInit__raylet
(pid=6936)     PyInit__raylet
(pid=6936)     PyInit__raylet
(pid=6936)     PyInit__raylet
(pid=6936)     PyInit__raylet
(pid=6936)     PyInit__raylet
(pid=6936)     PyNumber_InPlaceLshift
(pid=6936)     Py_CheckFunctionResult
(pid=6936)     PyEval_EvalFrameDefault
(pid=6936)     Py_CheckFunctionResult
(pid=6936)     PyEval_EvalFrameDefault
(pid=6936)     PyEval_EvalCodeWithName
(pid=6936)     PyEval_EvalCodeEx
(pid=6936)     PyEval_EvalCode
(pid=6936)     PyArena_New
(pid=6936)     PyArena_New
(pid=6936)     PyRun_FileExFlags
(pid=6936)     PyRun_SimpleFileExFlags
(pid=6936)     PyRun_AnyFileExFlags
(pid=6936)     Py_FatalError
(pid=6936)     Py_RunMain
(pid=6936)     Py_RunMain
(pid=6936)     Py_Main
(pid=6936)     BaseThreadInitThunk
(pid=6936)     RtlUserThreadStart
(pid=6936) 
(pid=6936) Windows fatal exception: access violation
(pid=6936) 
(pid=6936) Stack (most recent call first):
(pid=6936)   File "c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\ray\worker.py", line 425 in main_loop
(pid=6936)   File "c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\ray\workers/default_worker.py", line 218 in <module>
2021-12-01 23:13:35,937 INFO tune.py:630 -- Total run time: 128.19 seconds (127.68 seconds for the tuning loop).

== Status ==
Current time: 2021-12-01 23:11:32 (running for 00:00:05.14)
Memory usage on this node: 9.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 PENDING)

Trial name	status	loc
PPO_StatelessCartPole_a1402_00000	PENDING

== Status ==
Current time: 2021-12-01 23:12:01 (running for 00:00:33.99)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StatelessCartPole_a1402_00000	RUNNING	127.0.0.1:7032

== Status ==
Current time: 2021-12-01 23:12:02 (running for 00:00:35.21)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StatelessCartPole_a1402_00000	RUNNING	127.0.0.1:7032

== Status ==
Current time: 2021-12-01 23:12:08 (running for 00:00:40.28)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StatelessCartPole_a1402_00000	RUNNING	127.0.0.1:7032

Result for PPO_StatelessCartPole_a1402_00000:
  agent_timesteps_total: 4000
  custom_metrics: {}
  date: 2021-12-01_23-12-12
  done: false
  episode_len_mean: 22.420454545454547
  episode_media: {}
  episode_reward_max: 76.0
  episode_reward_mean: 22.420454545454547
  episode_reward_min: 9.0
  episodes_this_iter: 176
  episodes_total: 176
  experiment_id: a99c739e101a4c88ba77c4f9b0d64803
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6806640028953552
          entropy_coeff: 0.0
          kl: 0.013389287516474724
          model: {}
          policy_loss: -0.021481554955244064
          total_loss: 188.75352478027344
          vf_explained_var: -0.03809177502989769
          vf_loss: 188.77232360839844
    num_agent_steps_sampled: 4000
    num_agent_steps_trained: 4000
    num_steps_sampled: 4000
    num_steps_trained: 4000
  iterations_since_restore: 1
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 80.06666666666668
    ram_util_percent: 86.22666666666666
  pid: 7032
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.12501936271684463
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.18659985231728216
    mean_inference_ms: 2.6111630884934134
    mean_raw_obs_processing_ms: 0.3001049366084402
  time_since_restore: 10.872305870056152
  time_this_iter_s: 10.872305870056152
  time_total_s: 10.872305870056152
  timers:
    learn_throughput: 952.767
    learn_time_ms: 4198.298
    load_throughput: 0.0
    load_time_ms: 0.0
    sample_throughput: 597.657
    sample_time_ms: 6692.801
    update_time_ms: 0.0
  timestamp: 1638396732
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 4000
  training_iteration: 1
  trial_id: a1402_00000
  
Result for PPO_StatelessCartPole_a1402_00000:
  agent_timesteps_total: 8000
  custom_metrics: {}
  date: 2021-12-01_23-12-21
  done: false
  episode_len_mean: 27.07482993197279
  episode_media: {}
  episode_reward_max: 92.0
  episode_reward_mean: 27.07482993197279
  episode_reward_min: 9.0
  episodes_this_iter: 147
  episodes_total: 323
  experiment_id: a99c739e101a4c88ba77c4f9b0d64803
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6690337657928467
          entropy_coeff: 0.0
          kl: 0.006590469740331173
          model: {}
          policy_loss: -0.005931881722062826
          total_loss: 152.8258056640625
          vf_explained_var: -0.11891558021306992
          vf_loss: 152.83041381835938
    num_agent_steps_sampled: 8000
    num_agent_steps_trained: 8000
    num_steps_sampled: 8000
    num_steps_trained: 8000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 2
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 76.04615384615386
    ram_util_percent: 86.37692307692308
  pid: 7032
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.11535685211184185
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.17986845152071707
    mean_inference_ms: 2.358616568210475
    mean_raw_obs_processing_ms: 0.24868940552548166
  time_since_restore: 19.69421625137329
  time_this_iter_s: 8.821910381317139
  time_total_s: 19.69421625137329
  timers:
    learn_throughput: 1030.618
    learn_time_ms: 3881.168
    load_throughput: 0.0
    load_time_ms: 0.0
    sample_throughput: 493.345
    sample_time_ms: 8107.914
    update_time_ms: 0.0
  timestamp: 1638396741
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 8000
  training_iteration: 2
  trial_id: a1402_00000
  
Result for PPO_StatelessCartPole_a1402_00000:
  agent_timesteps_total: 12000
  custom_metrics: {}
  date: 2021-12-01_23-12-29
  done: false
  episode_len_mean: 31.515625
  episode_media: {}
  episode_reward_max: 107.0
  episode_reward_mean: 31.515625
  episode_reward_min: 9.0
  episodes_this_iter: 128
  episodes_total: 451
  experiment_id: a99c739e101a4c88ba77c4f9b0d64803
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6504582166671753
          entropy_coeff: 0.0
          kl: 0.01009401399642229
          model: {}
          policy_loss: -0.014275692403316498
          total_loss: 183.70419311523438
          vf_explained_var: -0.09810103476047516
          vf_loss: 183.71646118164062
    num_agent_steps_sampled: 12000
    num_agent_steps_trained: 12000
    num_steps_sampled: 12000
    num_steps_trained: 12000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 3
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 74.85000000000001
    ram_util_percent: 86.45
  pid: 7032
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.11715046978259044
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.16287547160468283
    mean_inference_ms: 2.2526623070031446
    mean_raw_obs_processing_ms: 0.22798632078370612
  time_since_restore: 27.94515609741211
  time_this_iter_s: 8.250939846038818
  time_total_s: 27.94515609741211
  timers:
    learn_throughput: 1097.145
    learn_time_ms: 3645.825
    load_throughput: 0.0
    load_time_ms: 0.0
    sample_throughput: 482.347
    sample_time_ms: 8292.781
    update_time_ms: 1.334
  timestamp: 1638396749
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 12000
  training_iteration: 3
  trial_id: a1402_00000
  
Result for PPO_StatelessCartPole_a1402_00000:
  agent_timesteps_total: 16000
  custom_metrics: {}
  date: 2021-12-01_23-12-39
  done: false
  episode_len_mean: 41.78
  episode_media: {}
  episode_reward_max: 114.0
  episode_reward_mean: 41.78
  episode_reward_min: 10.0
  episodes_this_iter: 94
  episodes_total: 545
  experiment_id: a99c739e101a4c88ba77c4f9b0d64803
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6398107409477234
          entropy_coeff: 0.0
          kl: 0.006905578076839447
          model: {}
          policy_loss: -9.581594349583611e-05
          total_loss: 233.55148315429688
          vf_explained_var: -0.05927522853016853
          vf_loss: 233.55018615722656
    num_agent_steps_sampled: 16000
    num_agent_steps_trained: 16000
    num_steps_sampled: 16000
    num_steps_trained: 16000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 4
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 80.42307692307692
    ram_util_percent: 86.43846153846154
  pid: 7032
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.11285530769795944
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.15454845909642295
    mean_inference_ms: 2.234221561962398
    mean_raw_obs_processing_ms: 0.23170078440755937
  time_since_restore: 37.353280544281006
  time_this_iter_s: 9.408124446868896
  time_total_s: 37.353280544281006
  timers:
    learn_throughput: 1071.336
    learn_time_ms: 3733.658
    load_throughput: 15993532.888
    load_time_ms: 0.25
    sample_throughput: 477.101
    sample_time_ms: 8383.978
    update_time_ms: 2.002
  timestamp: 1638396759
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 16000
  training_iteration: 4
  trial_id: a1402_00000
  
Result for PPO_StatelessCartPole_a1402_00000:
  agent_timesteps_total: 20000
  custom_metrics: {}
  date: 2021-12-01_23-12-48
  done: false
  episode_len_mean: 45.25
  episode_media: {}
  episode_reward_max: 112.0
  episode_reward_mean: 45.25
  episode_reward_min: 11.0
  episodes_this_iter: 90
  episodes_total: 635
  experiment_id: a99c739e101a4c88ba77c4f9b0d64803
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6046473383903503
          entropy_coeff: 0.0
          kl: 0.00994083285331726
          model: {}
          policy_loss: -0.013053015805780888
          total_loss: 248.39173889160156
          vf_explained_var: -0.07114432007074356
          vf_loss: 248.4027862548828
    num_agent_steps_sampled: 20000
    num_agent_steps_trained: 20000
    num_steps_sampled: 20000
    num_steps_trained: 20000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 5
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 82.4
    ram_util_percent: 86.02307692307691
  pid: 7032
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.1164652225945224
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.1496187880432378
    mean_inference_ms: 2.2751065340130405
    mean_raw_obs_processing_ms: 0.2258053944148459
  time_since_restore: 46.962218284606934
  time_this_iter_s: 9.608937740325928
  time_total_s: 46.962218284606934
  timers:
    learn_throughput: 1078.068
    learn_time_ms: 3710.34
    load_throughput: 19991916.111
    load_time_ms: 0.2
    sample_throughput: 459.08
    sample_time_ms: 8713.071
    update_time_ms: 4.935
  timestamp: 1638396768
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 20000
  training_iteration: 5
  trial_id: a1402_00000
  
Result for PPO_StatelessCartPole_a1402_00000:
  agent_timesteps_total: 24000
  custom_metrics: {}
  date: 2021-12-01_23-12-58
  done: false
  episode_len_mean: 51.02
  episode_media: {}
  episode_reward_max: 167.0
  episode_reward_mean: 51.02
  episode_reward_min: 14.0
  episodes_this_iter: 74
  episodes_total: 709
  experiment_id: a99c739e101a4c88ba77c4f9b0d64803
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.5902564525604248
          entropy_coeff: 0.0
          kl: 0.009673806838691235
          model: {}
          policy_loss: -0.0029914507176727057
          total_loss: 325.0210266113281
          vf_explained_var: -0.020246472209692
          vf_loss: 325.0221252441406
    num_agent_steps_sampled: 24000
    num_agent_steps_trained: 24000
    num_steps_sampled: 24000
    num_steps_trained: 24000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 6
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 80.94615384615385
    ram_util_percent: 85.85384615384616
  pid: 7032
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.12087068704524867
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.15143744867276646
    mean_inference_ms: 2.305419644492814
    mean_raw_obs_processing_ms: 0.22585836913700305
  time_since_restore: 56.67875838279724
  time_this_iter_s: 9.716540098190308
  time_total_s: 56.67875838279724
  timers:
    learn_throughput: 1089.778
    learn_time_ms: 3670.472
    load_throughput: 23990299.333
    load_time_ms: 0.167
    sample_throughput: 448.858
    sample_time_ms: 8911.498
    update_time_ms: 4.112
  timestamp: 1638396778
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 24000
  training_iteration: 6
  trial_id: a1402_00000
  
Result for PPO_StatelessCartPole_a1402_00000:
  agent_timesteps_total: 28000
  custom_metrics: {}
  date: 2021-12-01_23-13-07
  done: false
  episode_len_mean: 58.96
  episode_media: {}
  episode_reward_max: 173.0
  episode_reward_mean: 58.96
  episode_reward_min: 12.0
  episodes_this_iter: 64
  episodes_total: 773
  experiment_id: a99c739e101a4c88ba77c4f9b0d64803
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.5728155970573425
          entropy_coeff: 0.0
          kl: 0.008897491730749607
          model: {}
          policy_loss: -0.011549570597708225
          total_loss: 301.4852294921875
          vf_explained_var: -0.004416638985276222
          vf_loss: 301.4949951171875
    num_agent_steps_sampled: 28000
    num_agent_steps_trained: 28000
    num_steps_sampled: 28000
    num_steps_trained: 28000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 7
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 81.25384615384615
    ram_util_percent: 86.0
  pid: 7032
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.12220176410428243
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.14921622016148267
    mean_inference_ms: 2.304350481426518
    mean_raw_obs_processing_ms: 0.22517177176600833
  time_since_restore: 65.64418339729309
  time_this_iter_s: 8.96542501449585
  time_total_s: 65.64418339729309
  timers:
    learn_throughput: 1099.932
    learn_time_ms: 3636.59
    load_throughput: 27988682.555
    load_time_ms: 0.143
    sample_throughput: 448.012
    sample_time_ms: 8928.331
    update_time_ms: 3.525
  timestamp: 1638396787
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 28000
  training_iteration: 7
  trial_id: a1402_00000
  
Result for PPO_StatelessCartPole_a1402_00000:
  agent_timesteps_total: 32000
  custom_metrics: {}
  date: 2021-12-01_23-13-16
  done: false
  episode_len_mean: 65.85
  episode_media: {}
  episode_reward_max: 173.0
  episode_reward_mean: 65.85
  episode_reward_min: 12.0
  episodes_this_iter: 57
  episodes_total: 830
  experiment_id: a99c739e101a4c88ba77c4f9b0d64803
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.5591020584106445
          entropy_coeff: 0.0
          kl: 0.010199657641351223
          model: {}
          policy_loss: -0.0008332685683853924
          total_loss: 395.8393859863281
          vf_explained_var: 0.02246681973338127
          vf_loss: 395.8381652832031
    num_agent_steps_sampled: 32000
    num_agent_steps_trained: 32000
    num_steps_sampled: 32000
    num_steps_trained: 32000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 8
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 74.975
    ram_util_percent: 86.25000000000001
  pid: 7032
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.11971547914161862
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.1491953792440581
    mean_inference_ms: 2.288866379945899
    mean_raw_obs_processing_ms: 0.21941612443688768
  time_since_restore: 74.37759304046631
  time_this_iter_s: 8.733409643173218
  time_total_s: 74.37759304046631
  timers:
    learn_throughput: 1104.214
    learn_time_ms: 3622.488
    load_throughput: 31987065.777
    load_time_ms: 0.125
    sample_throughput: 449.229
    sample_time_ms: 8904.153
    update_time_ms: 3.084
  timestamp: 1638396796
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 32000
  training_iteration: 8
  trial_id: a1402_00000
  
Result for PPO_StatelessCartPole_a1402_00000:
  agent_timesteps_total: 36000
  custom_metrics: {}
  date: 2021-12-01_23-13-26
  done: false
  episode_len_mean: 75.45
  episode_media: {}
  episode_reward_max: 294.0
  episode_reward_mean: 75.45
  episode_reward_min: 15.0
  episodes_this_iter: 49
  episodes_total: 879
  experiment_id: a99c739e101a4c88ba77c4f9b0d64803
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.5561075806617737
          entropy_coeff: 0.0
          kl: 0.010793409310281277
          model: {}
          policy_loss: -0.006749303545802832
          total_loss: 487.8432922363281
          vf_explained_var: 0.037814658135175705
          vf_loss: 487.8478698730469
    num_agent_steps_sampled: 36000
    num_agent_steps_trained: 36000
    num_steps_sampled: 36000
    num_steps_trained: 36000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 9
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 87.64285714285714
    ram_util_percent: 86.90714285714286
  pid: 7032
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.12074714910315451
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.15026008834474422
    mean_inference_ms: 2.3119485993450812
    mean_raw_obs_processing_ms: 0.21632637276503874
  time_since_restore: 84.76655960083008
  time_this_iter_s: 10.38896656036377
  time_total_s: 84.76655960083008
  timers:
    learn_throughput: 1111.045
    learn_time_ms: 3600.213
    load_throughput: 35985448.999
    load_time_ms: 0.111
    sample_throughput: 440.37
    sample_time_ms: 9083.268
    update_time_ms: 2.964
  timestamp: 1638396806
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 36000
  training_iteration: 9
  trial_id: a1402_00000
  
Result for PPO_StatelessCartPole_a1402_00000:
  agent_timesteps_total: 40000
  custom_metrics: {}
  date: 2021-12-01_23-13-35
  done: true
  episode_len_mean: 84.4
  episode_media: {}
  episode_reward_max: 294.0
  episode_reward_mean: 84.4
  episode_reward_min: 17.0
  episodes_this_iter: 40
  episodes_total: 919
  experiment_id: a99c739e101a4c88ba77c4f9b0d64803
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.5557733774185181
          entropy_coeff: 0.0
          kl: 0.005785451736301184
          model: {}
          policy_loss: -0.0009503072360530496
          total_loss: 503.70172119140625
          vf_explained_var: 0.0725829154253006
          vf_loss: 503.7015380859375
    num_agent_steps_sampled: 40000
    num_agent_steps_trained: 40000
    num_steps_sampled: 40000
    num_steps_trained: 40000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 10
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 79.30833333333332
    ram_util_percent: 87.125
  pid: 7032
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.12150490867421336
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.14914803016875297
    mean_inference_ms: 2.327153258190586
    mean_raw_obs_processing_ms: 0.21595014888922642
  time_since_restore: 93.20761466026306
  time_this_iter_s: 8.441055059432983
  time_total_s: 93.20761466026306
  timers:
    learn_throughput: 1120.545
    learn_time_ms: 3569.692
    load_throughput: 20046858.645
    load_time_ms: 0.2
    sample_throughput: 442.479
    sample_time_ms: 9039.968
    update_time_ms: 2.868
  timestamp: 1638396815
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 40000
  training_iteration: 10
  trial_id: a1402_00000
  
Option 3a2 with Trajectory API: Training finished successfully

== Status ==
Current time: 2021-12-01 23:12:13 (running for 00:00:45.92)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_a1402_00000	RUNNING	127.0.0.1:7032	1	10.8723	4000	22.4205	76	9	22.4205

== Status ==
Current time: 2021-12-01 23:12:18 (running for 00:00:51.00)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_a1402_00000	RUNNING	127.0.0.1:7032	1	10.8723	4000	22.4205	76	9	22.4205

== Status ==
Current time: 2021-12-01 23:12:24 (running for 00:00:56.98)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_a1402_00000	RUNNING	127.0.0.1:7032	2	19.6942	8000	27.0748	92	9	27.0748

== Status ==
Current time: 2021-12-01 23:12:29 (running for 00:01:02.06)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_a1402_00000	RUNNING	127.0.0.1:7032	3	27.9452	12000	31.5156	107	9	31.5156

== Status ==
Current time: 2021-12-01 23:12:34 (running for 00:01:07.12)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_a1402_00000	RUNNING	127.0.0.1:7032	3	27.9452	12000	31.5156	107	9	31.5156

== Status ==
Current time: 2021-12-01 23:12:40 (running for 00:01:12.53)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_a1402_00000	RUNNING	127.0.0.1:7032	4	37.3533	16000	41.78	114	10	41.78

== Status ==
Current time: 2021-12-01 23:12:45 (running for 00:01:17.88)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_a1402_00000	RUNNING	127.0.0.1:7032	4	37.3533	16000	41.78	114	10	41.78

== Status ==
Current time: 2021-12-01 23:12:51 (running for 00:01:23.24)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_a1402_00000	RUNNING	127.0.0.1:7032	5	46.9622	20000	45.25	112	11	45.25

== Status ==
Current time: 2021-12-01 23:12:56 (running for 00:01:28.34)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_a1402_00000	RUNNING	127.0.0.1:7032	5	46.9622	20000	45.25	112	11	45.25

== Status ==
Current time: 2021-12-01 23:13:01 (running for 00:01:33.98)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_a1402_00000	RUNNING	127.0.0.1:7032	6	56.6788	24000	51.02	167	14	51.02

== Status ==
Current time: 2021-12-01 23:13:06 (running for 00:01:39.08)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_a1402_00000	RUNNING	127.0.0.1:7032	6	56.6788	24000	51.02	167	14	51.02

== Status ==
Current time: 2021-12-01 23:13:12 (running for 00:01:45.03)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_a1402_00000	RUNNING	127.0.0.1:7032	7	65.6442	28000	58.96	173	12	58.96

== Status ==
Current time: 2021-12-01 23:13:17 (running for 00:01:50.20)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_a1402_00000	RUNNING	127.0.0.1:7032	8	74.3776	32000	65.85	173	12	65.85

== Status ==
Current time: 2021-12-01 23:13:23 (running for 00:01:55.26)
Memory usage on this node: 10.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_a1402_00000	RUNNING	127.0.0.1:7032	8	74.3776	32000	65.85	173	12	65.85

== Status ==
Current time: 2021-12-01 23:13:29 (running for 00:02:01.24)
Memory usage on this node: 10.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_a1402_00000	RUNNING	127.0.0.1:7032	9	84.7666	36000	75.45	294	15	75.45

== Status ==
Current time: 2021-12-01 23:13:34 (running for 00:02:06.30)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_a1402_00000	RUNNING	127.0.0.1:7032	9	84.7666	36000	75.45	294	15	75.45

== Status ==
Current time: 2021-12-01 23:13:35 (running for 00:02:07.72)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 TERMINATED)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_a1402_00000	TERMINATED	127.0.0.1:7032	10	93.2076	40000	84.4	294	17	84.4

print_reward(results3a2)

Reward after 10 training iterations: 84.4

plot_rewards(results3a2)

c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

# stacking observations inside the model works worse?
plot_learning(results3a, label="3a: Stacked Obs in Env")
plot_learning(results3a2, label="3a2: Stacked Obs in Model")

Option 3b: Use an LSTM for Processing the Sequence

Instead of stacking the last \(n\) observations and providing this sequence as input to a regular feed-forward neural network, a recurrent neural network (RNN) can be used, keeping track of a learned state that is passed onwards from observation to observation.

Long short-term memory (LSTM) networks are a variant of RNNs that are good at keeping state for longer durations. To use an LSTM with RLlib, simply set the corresponding flag in the model config:

#collapse-output

config3b = ppo.DEFAULT_CONFIG.copy()
config3b["env"] = "StatelessCartPole"
config3b["model"] = {
    "use_lstm": True,
    # "max_seq_len": 10,
}

results3b = ray.tune.run("PPO", config=config3b, stop=stop)
print("Option 3b: Training finished successfully")

== Status ==
Current time: 2021-12-01 23:14:43 (running for 00:00:00.14)
Memory usage on this node: 9.6/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 PENDING)

Trial name	status	loc
PPO_StatelessCartPole_15f17_00000	PENDING

(pid=None) c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\redis\connection.py:77: UserWarning: redis-py works best with hiredis. Please consider installing
(pid=None)   warnings.warn(msg)
(pid=14344) 2021-12-01 23:14:58,972 INFO trainer.py:753 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
(pid=14344) 2021-12-01 23:14:58,972 INFO ppo.py:166 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
(pid=14344) 2021-12-01 23:14:58,972 INFO trainer.py:770 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(pid=None) c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\redis\connection.py:77: UserWarning: redis-py works best with hiredis. Please consider installing
(pid=None)   warnings.warn(msg)
(pid=None) c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\redis\connection.py:77: UserWarning: redis-py works best with hiredis. Please consider installing
(pid=None)   warnings.warn(msg)
(pid=8656) 2021-12-01 23:15:15,322  WARNING deprecation.py:38 -- DeprecationWarning: `SampleBatch['is_training']` has been deprecated. Use `SampleBatch.is_training` instead. This will raise an error in the future!
(pid=14344) 2021-12-01 23:15:20,461 WARNING deprecation.py:38 -- DeprecationWarning: `SampleBatch['is_training']` has been deprecated. Use `SampleBatch.is_training` instead. This will raise an error in the future!
(pid=14344) 2021-12-01 23:15:23,593 WARNING trainer_template.py:185 -- `execution_plan` functions should accept `trainer`, `workers`, and `config` as args!
(pid=14344) 2021-12-01 23:15:23,593 INFO trainable.py:110 -- Trainable.setup took 24.634 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
(pid=14344) 2021-12-01 23:15:23,593 WARNING util.py:57 -- Install gputil for GPU system monitoring.
(pid=14344) 2021-12-01 23:15:32,644 WARNING deprecation.py:38 -- DeprecationWarning: `slice` has been deprecated. Use `SampleBatch[start:stop]` instead. This will raise an error in the future!
(pid=14344) Windows fatal exception: access violation
(pid=14344) 
2021-12-01 23:26:57,480 INFO tune.py:630 -- Total run time: 733.96 seconds (733.70 seconds for the tuning loop).

== Status ==
Current time: 2021-12-01 23:14:48 (running for 00:00:05.16)
Memory usage on this node: 9.6/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 PENDING)

Trial name	status	loc
PPO_StatelessCartPole_15f17_00000	PENDING

== Status ==
Current time: 2021-12-01 23:15:23 (running for 00:00:40.06)
Memory usage on this node: 10.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344

== Status ==
Current time: 2021-12-01 23:15:25 (running for 00:00:42.11)
Memory usage on this node: 10.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344

== Status ==
Current time: 2021-12-01 23:15:30 (running for 00:00:47.44)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344

== Status ==
Current time: 2021-12-01 23:15:36 (running for 00:00:52.62)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344

== Status ==
Current time: 2021-12-01 23:15:41 (running for 00:00:57.90)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344

== Status ==
Current time: 2021-12-01 23:15:46 (running for 00:01:03.24)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344

== Status ==
Current time: 2021-12-01 23:15:52 (running for 00:01:08.55)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344

== Status ==
Current time: 2021-12-01 23:15:57 (running for 00:01:13.93)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344

== Status ==
Current time: 2021-12-01 23:16:02 (running for 00:01:19.04)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344

== Status ==
Current time: 2021-12-01 23:16:07 (running for 00:01:24.29)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344

== Status ==
Current time: 2021-12-01 23:16:12 (running for 00:01:29.41)
Memory usage on this node: 10.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344

== Status ==
Current time: 2021-12-01 23:16:18 (running for 00:01:35.06)
Memory usage on this node: 10.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344

== Status ==
Current time: 2021-12-01 23:16:23 (running for 00:01:40.21)
Memory usage on this node: 10.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344

== Status ==
Current time: 2021-12-01 23:16:29 (running for 00:01:45.76)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344

== Status ==
Current time: 2021-12-01 23:16:34 (running for 00:01:50.86)
Memory usage on this node: 10.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344

== Status ==
Current time: 2021-12-01 23:16:39 (running for 00:01:56.00)
Memory usage on this node: 10.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344

Result for PPO_StatelessCartPole_15f17_00000:
  agent_timesteps_total: 4000
  custom_metrics: {}
  date: 2021-12-01_23-16-44
  done: false
  episode_len_mean: 24.74534161490683
  episode_media: {}
  episode_reward_max: 73.0
  episode_reward_mean: 24.74534161490683
  episode_reward_min: 9.0
  episodes_this_iter: 161
  episodes_total: 161
  experiment_id: e1f0e5c8896845ff90ba05d38befaef8
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6702085137367249
          entropy_coeff: 0.0
          kl: 0.01620529219508171
          model: {}
          policy_loss: -0.021323969587683678
          total_loss: 147.9649658203125
          vf_explained_var: -0.08384507149457932
          vf_loss: 147.98304748535156
    num_agent_steps_sampled: 4000
    num_agent_steps_trained: 4000
    num_steps_sampled: 4000
    num_steps_trained: 4000
  iterations_since_restore: 1
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 98.09908256880733
    ram_util_percent: 87.94036697247705
  pid: 14344
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.14404039499775298
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.15224504509160317
    mean_inference_ms: 3.842544658680462
    mean_raw_obs_processing_ms: 0.25258780840844064
  time_since_restore: 80.78264856338501
  time_this_iter_s: 80.78264856338501
  time_total_s: 80.78264856338501
  timers:
    learn_throughput: 55.809
    learn_time_ms: 71672.947
    load_throughput: 4001243.978
    load_time_ms: 1.0
    sample_throughput: 442.023
    sample_time_ms: 9049.296
    update_time_ms: 14.003
  timestamp: 1638397004
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 4000
  training_iteration: 1
  trial_id: 15f17_00000
  
Result for PPO_StatelessCartPole_15f17_00000:
  agent_timesteps_total: 8000
  custom_metrics: {}
  date: 2021-12-01_23-17-57
  done: false
  episode_len_mean: 29.848484848484848
  episode_media: {}
  episode_reward_max: 85.0
  episode_reward_mean: 29.848484848484848
  episode_reward_min: 9.0
  episodes_this_iter: 132
  episodes_total: 293
  experiment_id: e1f0e5c8896845ff90ba05d38befaef8
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6473504304885864
          entropy_coeff: 0.0
          kl: 0.010533971711993217
          model: {}
          policy_loss: -0.009972598403692245
          total_loss: 134.57276916503906
          vf_explained_var: 0.15424844622612
          vf_loss: 134.58062744140625
    num_agent_steps_sampled: 8000
    num_agent_steps_trained: 8000
    num_steps_sampled: 8000
    num_steps_trained: 8000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 2
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 98.81443298969072
    ram_util_percent: 88.0278350515464
  pid: 14344
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.1454985114758189
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.1627394677104148
    mean_inference_ms: 3.657513714820443
    mean_raw_obs_processing_ms: 0.2482827895962494
  time_since_restore: 153.95946764945984
  time_this_iter_s: 73.17681908607483
  time_total_s: 153.95946764945984
  timers:
    learn_throughput: 58.598
    learn_time_ms: 68261.841
    load_throughput: 8002487.956
    load_time_ms: 0.5
    sample_throughput: 89.708
    sample_time_ms: 44589.185
    update_time_ms: 10.501
  timestamp: 1638397077
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 8000
  training_iteration: 2
  trial_id: 15f17_00000
  
Result for PPO_StatelessCartPole_15f17_00000:
  agent_timesteps_total: 12000
  custom_metrics: {}
  date: 2021-12-01_23-19-11
  done: false
  episode_len_mean: 32.095238095238095
  episode_media: {}
  episode_reward_max: 91.0
  episode_reward_mean: 32.095238095238095
  episode_reward_min: 9.0
  episodes_this_iter: 126
  episodes_total: 419
  experiment_id: e1f0e5c8896845ff90ba05d38befaef8
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6104030013084412
          entropy_coeff: 0.0
          kl: 0.01123636681586504
          model: {}
          policy_loss: -0.0049535431899130344
          total_loss: 151.1492919921875
          vf_explained_var: 0.1538025140762329
          vf_loss: 151.15199279785156
    num_agent_steps_sampled: 12000
    num_agent_steps_trained: 12000
    num_steps_sampled: 12000
    num_steps_trained: 12000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 3
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 97.37070707070708
    ram_util_percent: 87.73737373737374
  pid: 14344
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.1454022761596221
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.1557348165264045
    mean_inference_ms: 3.294093173663449
    mean_raw_obs_processing_ms: 0.25605837493978906
  time_since_restore: 227.74660396575928
  time_this_iter_s: 73.78713631629944
  time_total_s: 227.74660396575928
  timers:
    learn_throughput: 58.855
    learn_time_ms: 67963.962
    load_throughput: 6009031.519
    load_time_ms: 0.666
    sample_throughput: 74.781
    sample_time_ms: 53489.736
    update_time_ms: 8.973
  timestamp: 1638397151
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 12000
  training_iteration: 3
  trial_id: 15f17_00000
  
Result for PPO_StatelessCartPole_15f17_00000:
  agent_timesteps_total: 16000
  custom_metrics: {}
  date: 2021-12-01_23-20-22
  done: false
  episode_len_mean: 42.16
  episode_media: {}
  episode_reward_max: 97.0
  episode_reward_mean: 42.16
  episode_reward_min: 12.0
  episodes_this_iter: 93
  episodes_total: 512
  experiment_id: e1f0e5c8896845ff90ba05d38befaef8
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6158422827720642
          entropy_coeff: 0.0
          kl: 0.010851425118744373
          model: {}
          policy_loss: -0.003959023393690586
          total_loss: 152.54002380371094
          vf_explained_var: 0.17243239283561707
          vf_loss: 152.5417938232422
    num_agent_steps_sampled: 16000
    num_agent_steps_trained: 16000
    num_steps_sampled: 16000
    num_steps_trained: 16000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 4
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 99.08333333333333
    ram_util_percent: 85.98645833333335
  pid: 14344
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.1462575659080806
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.15407968489488727
    mean_inference_ms: 3.275941401825844
    mean_raw_obs_processing_ms: 0.24603215260913444
  time_since_restore: 298.2684841156006
  time_this_iter_s: 70.52188014984131
  time_total_s: 298.2684841156006
  timers:
    learn_throughput: 59.973
    learn_time_ms: 66696.53
    load_throughput: 8012042.025
    load_time_ms: 0.499
    sample_throughput: 67.917
    sample_time_ms: 58895.107
    update_time_ms: 7.627
  timestamp: 1638397222
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 16000
  training_iteration: 4
  trial_id: 15f17_00000
  
Result for PPO_StatelessCartPole_15f17_00000:
  agent_timesteps_total: 20000
  custom_metrics: {}
  date: 2021-12-01_23-21-45
  done: false
  episode_len_mean: 33.36974789915966
  episode_media: {}
  episode_reward_max: 103.0
  episode_reward_mean: 33.36974789915966
  episode_reward_min: 10.0
  episodes_this_iter: 119
  episodes_total: 631
  experiment_id: e1f0e5c8896845ff90ba05d38befaef8
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6034429669380188
          entropy_coeff: 0.0
          kl: 0.012084417045116425
          model: {}
          policy_loss: -0.012375776655972004
          total_loss: 157.4758758544922
          vf_explained_var: 0.14680173993110657
          vf_loss: 157.48585510253906
    num_agent_steps_sampled: 20000
    num_agent_steps_trained: 20000
    num_steps_sampled: 20000
    num_steps_trained: 20000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 5
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 98.08396226415095
    ram_util_percent: 85.38396226415095
  pid: 14344
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.14943092584832965
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.1572909528662859
    mean_inference_ms: 3.248902887560971
    mean_raw_obs_processing_ms: 0.24150111904383836
  time_since_restore: 381.94007539749146
  time_this_iter_s: 83.67159128189087
  time_total_s: 381.94007539749146
  timers:
    learn_throughput: 58.34
    learn_time_ms: 68563.84
    load_throughput: 10015052.531
    load_time_ms: 0.399
    sample_throughput: 65.33
    sample_time_ms: 61227.393
    update_time_ms: 8.119
  timestamp: 1638397305
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 20000
  training_iteration: 5
  trial_id: 15f17_00000
  
Result for PPO_StatelessCartPole_15f17_00000:
  agent_timesteps_total: 24000
  custom_metrics: {}
  date: 2021-12-01_23-22-56
  done: false
  episode_len_mean: 39.95
  episode_media: {}
  episode_reward_max: 100.0
  episode_reward_mean: 39.95
  episode_reward_min: 9.0
  episodes_this_iter: 100
  episodes_total: 731
  experiment_id: e1f0e5c8896845ff90ba05d38befaef8
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6064294576644897
          entropy_coeff: 0.0
          kl: 0.006135161034762859
          model: {}
          policy_loss: -0.0012852392392233014
          total_loss: 144.01748657226562
          vf_explained_var: 0.2577447295188904
          vf_loss: 144.01754760742188
    num_agent_steps_sampled: 24000
    num_agent_steps_trained: 24000
    num_steps_sampled: 24000
    num_steps_trained: 24000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 6
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 97.5808510638298
    ram_util_percent: 86.03617021276594
  pid: 14344
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.14752250682829854
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.15924224077925142
    mean_inference_ms: 3.1588659969113992
    mean_raw_obs_processing_ms: 0.23714881828949555
  time_since_restore: 452.454083442688
  time_this_iter_s: 70.51400804519653
  time_total_s: 452.454083442688
  timers:
    learn_throughput: 59.002
    learn_time_ms: 67794.396
    load_throughput: 12018063.037
    load_time_ms: 0.333
    sample_throughput: 61.729
    sample_time_ms: 64799.66
    update_time_ms: 7.201
  timestamp: 1638397376
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 24000
  training_iteration: 6
  trial_id: 15f17_00000
  
Result for PPO_StatelessCartPole_15f17_00000:
  agent_timesteps_total: 28000
  custom_metrics: {}
  date: 2021-12-01_23-23-57
  done: false
  episode_len_mean: 34.38461538461539
  episode_media: {}
  episode_reward_max: 93.0
  episode_reward_mean: 34.38461538461539
  episode_reward_min: 9.0
  episodes_this_iter: 117
  episodes_total: 848
  experiment_id: e1f0e5c8896845ff90ba05d38befaef8
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.602990984916687
          entropy_coeff: 0.0
          kl: 0.005702142603695393
          model: {}
          policy_loss: -0.0026899229269474745
          total_loss: 126.622802734375
          vf_explained_var: 0.28406739234924316
          vf_loss: 126.62434387207031
    num_agent_steps_sampled: 28000
    num_agent_steps_trained: 28000
    num_steps_sampled: 28000
    num_steps_trained: 28000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 7
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 97.09285714285713
    ram_util_percent: 85.65119047619051
  pid: 14344
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.14141074845480212
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.1582564423010624
    mean_inference_ms: 3.1048115340137876
    mean_raw_obs_processing_ms: 0.23267272113813148
  time_since_restore: 513.63032746315
  time_this_iter_s: 61.176244020462036
  time_total_s: 513.63032746315
  timers:
    learn_throughput: 60.697
    learn_time_ms: 65900.842
    load_throughput: 14021073.543
    load_time_ms: 0.285
    sample_throughput: 60.942
    sample_time_ms: 65636.225
    update_time_ms: 6.744
  timestamp: 1638397437
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 28000
  training_iteration: 7
  trial_id: 15f17_00000
  
Result for PPO_StatelessCartPole_15f17_00000:
  agent_timesteps_total: 32000
  custom_metrics: {}
  date: 2021-12-01_23-24-57
  done: false
  episode_len_mean: 36.44954128440367
  episode_media: {}
  episode_reward_max: 77.0
  episode_reward_mean: 36.44954128440367
  episode_reward_min: 11.0
  episodes_this_iter: 109
  episodes_total: 957
  experiment_id: e1f0e5c8896845ff90ba05d38befaef8
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.5927915573120117
          entropy_coeff: 0.0
          kl: 0.012266391888260841
          model: {}
          policy_loss: -0.0021681918296962976
          total_loss: 94.81949615478516
          vf_explained_var: 0.31072309613227844
          vf_loss: 94.81922149658203
    num_agent_steps_sampled: 32000
    num_agent_steps_trained: 32000
    num_steps_sampled: 32000
    num_steps_trained: 32000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 8
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 95.18048780487804
    ram_util_percent: 85.51341463414633
  pid: 14344
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.13663044317764386
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.15200395813252837
    mean_inference_ms: 3.023815960844668
    mean_raw_obs_processing_ms: 0.22803359086359726
  time_since_restore: 573.1020631790161
  time_this_iter_s: 59.47173571586609
  time_total_s: 573.1020631790161
  timers:
    learn_throughput: 62.141
    learn_time_ms: 64369.633
    load_throughput: 16024084.05
    load_time_ms: 0.25
    sample_throughput: 61.555
    sample_time_ms: 64982.899
    update_time_ms: 5.901
  timestamp: 1638397497
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 32000
  training_iteration: 8
  trial_id: 15f17_00000
  
Result for PPO_StatelessCartPole_15f17_00000:
  agent_timesteps_total: 36000
  custom_metrics: {}
  date: 2021-12-01_23-25-57
  done: false
  episode_len_mean: 34.80701754385965
  episode_media: {}
  episode_reward_max: 93.0
  episode_reward_mean: 34.80701754385965
  episode_reward_min: 10.0
  episodes_this_iter: 114
  episodes_total: 1071
  experiment_id: e1f0e5c8896845ff90ba05d38befaef8
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.5793452262878418
          entropy_coeff: 0.0
          kl: 0.009524409659206867
          model: {}
          policy_loss: 0.00497779343277216
          total_loss: 101.02494812011719
          vf_explained_var: 0.31699204444885254
          vf_loss: 101.01805877685547
    num_agent_steps_sampled: 36000
    num_agent_steps_trained: 36000
    num_steps_sampled: 36000
    num_steps_trained: 36000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 9
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 95.77349397590359
    ram_util_percent: 85.25903614457832
  pid: 14344
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.13135990601240866
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.14946784585673123
    mean_inference_ms: 2.9620819680365273
    mean_raw_obs_processing_ms: 0.22295928348485725
  time_since_restore: 633.5450580120087
  time_this_iter_s: 60.442994832992554
  time_total_s: 633.5450580120087
  timers:
    learn_throughput: 63.214
    learn_time_ms: 63277.532
    load_throughput: 18027094.556
    load_time_ms: 0.222
    sample_throughput: 62.131
    sample_time_ms: 64380.478
    update_time_ms: 6.023
  timestamp: 1638397557
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 36000
  training_iteration: 9
  trial_id: 15f17_00000
  
Result for PPO_StatelessCartPole_15f17_00000:
  agent_timesteps_total: 40000
  custom_metrics: {}
  date: 2021-12-01_23-26-57
  done: true
  episode_len_mean: 41.01
  episode_media: {}
  episode_reward_max: 91.0
  episode_reward_mean: 41.01
  episode_reward_min: 9.0
  episodes_this_iter: 97
  episodes_total: 1168
  experiment_id: e1f0e5c8896845ff90ba05d38befaef8
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.5690697431564331
          entropy_coeff: 0.0
          kl: 0.005614957306534052
          model: {}
          policy_loss: 0.0013457380700856447
          total_loss: 97.90939331054688
          vf_explained_var: 0.3871138393878937
          vf_loss: 97.90692138671875
    num_agent_steps_sampled: 40000
    num_agent_steps_trained: 40000
    num_steps_sampled: 40000
    num_steps_trained: 40000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 10
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 95.57160493827159
    ram_util_percent: 85.20246913580247
  pid: 14344
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.1271907494503554
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.1465406708441855
    mean_inference_ms: 2.916534058088682
    mean_raw_obs_processing_ms: 0.21554435209484965
  time_since_restore: 693.0942261219025
  time_this_iter_s: 59.5491681098938
  time_total_s: 693.0942261219025
  timers:
    learn_throughput: 64.181
    learn_time_ms: 62324.001
    load_throughput: 20030105.062
    load_time_ms: 0.2
    sample_throughput: 62.515
    sample_time_ms: 63984.333
    update_time_ms: 6.121
  timestamp: 1638397617
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 40000
  training_iteration: 10
  trial_id: 15f17_00000
  
Option 3b: Training finished successfully

== Status ==
Current time: 2021-12-01 23:16:45 (running for 00:02:01.96)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	1	80.7826	4000	24.7453	73	9	24.7453

== Status ==
Current time: 2021-12-01 23:16:50 (running for 00:02:07.22)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	1	80.7826	4000	24.7453	73	9	24.7453

== Status ==
Current time: 2021-12-01 23:16:55 (running for 00:02:12.37)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	1	80.7826	4000	24.7453	73	9	24.7453

== Status ==
Current time: 2021-12-01 23:17:01 (running for 00:02:17.58)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	1	80.7826	4000	24.7453	73	9	24.7453

== Status ==
Current time: 2021-12-01 23:17:06 (running for 00:02:22.68)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	1	80.7826	4000	24.7453	73	9	24.7453

== Status ==
Current time: 2021-12-01 23:17:11 (running for 00:02:28.07)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	1	80.7826	4000	24.7453	73	9	24.7453

== Status ==
Current time: 2021-12-01 23:17:16 (running for 00:02:33.22)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	1	80.7826	4000	24.7453	73	9	24.7453

== Status ==
Current time: 2021-12-01 23:17:22 (running for 00:02:38.61)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	1	80.7826	4000	24.7453	73	9	24.7453

== Status ==
Current time: 2021-12-01 23:17:27 (running for 00:02:43.93)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	1	80.7826	4000	24.7453	73	9	24.7453

== Status ==
Current time: 2021-12-01 23:17:32 (running for 00:02:49.19)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	1	80.7826	4000	24.7453	73	9	24.7453

== Status ==
Current time: 2021-12-01 23:17:37 (running for 00:02:54.33)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	1	80.7826	4000	24.7453	73	9	24.7453

== Status ==
Current time: 2021-12-01 23:17:43 (running for 00:02:59.68)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	1	80.7826	4000	24.7453	73	9	24.7453

== Status ==
Current time: 2021-12-01 23:17:48 (running for 00:03:04.83)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	1	80.7826	4000	24.7453	73	9	24.7453

== Status ==
Current time: 2021-12-01 23:17:53 (running for 00:03:10.02)
Memory usage on this node: 10.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	1	80.7826	4000	24.7453	73	9	24.7453

== Status ==
Current time: 2021-12-01 23:17:58 (running for 00:03:15.16)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	2	153.959	8000	29.8485	85	9	29.8485

== Status ==
Current time: 2021-12-01 23:18:03 (running for 00:03:20.24)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	2	153.959	8000	29.8485	85	9	29.8485

== Status ==
Current time: 2021-12-01 23:18:08 (running for 00:03:25.35)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	2	153.959	8000	29.8485	85	9	29.8485

== Status ==
Current time: 2021-12-01 23:18:14 (running for 00:03:30.69)
Memory usage on this node: 10.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	2	153.959	8000	29.8485	85	9	29.8485

== Status ==
Current time: 2021-12-01 23:18:19 (running for 00:03:35.82)
Memory usage on this node: 10.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	2	153.959	8000	29.8485	85	9	29.8485

== Status ==
Current time: 2021-12-01 23:18:24 (running for 00:03:41.03)
Memory usage on this node: 10.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	2	153.959	8000	29.8485	85	9	29.8485

== Status ==
Current time: 2021-12-01 23:18:29 (running for 00:03:46.16)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	2	153.959	8000	29.8485	85	9	29.8485

== Status ==
Current time: 2021-12-01 23:18:34 (running for 00:03:51.40)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	2	153.959	8000	29.8485	85	9	29.8485

== Status ==
Current time: 2021-12-01 23:18:40 (running for 00:03:56.51)
Memory usage on this node: 10.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	2	153.959	8000	29.8485	85	9	29.8485

== Status ==
Current time: 2021-12-01 23:18:45 (running for 00:04:01.86)
Memory usage on this node: 10.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	2	153.959	8000	29.8485	85	9	29.8485

== Status ==
Current time: 2021-12-01 23:18:50 (running for 00:04:07.05)
Memory usage on this node: 10.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	2	153.959	8000	29.8485	85	9	29.8485

== Status ==
Current time: 2021-12-01 23:18:55 (running for 00:04:12.42)
Memory usage on this node: 10.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	2	153.959	8000	29.8485	85	9	29.8485

== Status ==
Current time: 2021-12-01 23:19:01 (running for 00:04:17.63)
Memory usage on this node: 10.6/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	2	153.959	8000	29.8485	85	9	29.8485

== Status ==
Current time: 2021-12-01 23:19:06 (running for 00:04:23.08)
Memory usage on this node: 10.7/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	2	153.959	8000	29.8485	85	9	29.8485

== Status ==
Current time: 2021-12-01 23:19:12 (running for 00:04:29.01)
Memory usage on this node: 10.6/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	3	227.747	12000	32.0952	91	9	32.0952

== Status ==
Current time: 2021-12-01 23:19:17 (running for 00:04:34.08)
Memory usage on this node: 10.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	3	227.747	12000	32.0952	91	9	32.0952

== Status ==
Current time: 2021-12-01 23:19:22 (running for 00:04:39.15)
Memory usage on this node: 10.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	3	227.747	12000	32.0952	91	9	32.0952

== Status ==
Current time: 2021-12-01 23:19:28 (running for 00:04:44.49)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	3	227.747	12000	32.0952	91	9	32.0952

== Status ==
Current time: 2021-12-01 23:19:33 (running for 00:04:49.61)
Memory usage on this node: 10.4/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	3	227.747	12000	32.0952	91	9	32.0952

== Status ==
Current time: 2021-12-01 23:19:38 (running for 00:04:54.90)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	3	227.747	12000	32.0952	91	9	32.0952

== Status ==
Current time: 2021-12-01 23:19:43 (running for 00:04:60.00)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	3	227.747	12000	32.0952	91	9	32.0952

== Status ==
Current time: 2021-12-01 23:19:48 (running for 00:05:05.26)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	3	227.747	12000	32.0952	91	9	32.0952

== Status ==
Current time: 2021-12-01 23:19:53 (running for 00:05:10.39)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	3	227.747	12000	32.0952	91	9	32.0952

== Status ==
Current time: 2021-12-01 23:19:59 (running for 00:05:15.79)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	3	227.747	12000	32.0952	91	9	32.0952

== Status ==
Current time: 2021-12-01 23:20:04 (running for 00:05:20.90)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	3	227.747	12000	32.0952	91	9	32.0952

== Status ==
Current time: 2021-12-01 23:20:09 (running for 00:05:26.05)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	3	227.747	12000	32.0952	91	9	32.0952

== Status ==
Current time: 2021-12-01 23:20:14 (running for 00:05:31.20)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	3	227.747	12000	32.0952	91	9	32.0952

== Status ==
Current time: 2021-12-01 23:20:19 (running for 00:05:36.44)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	3	227.747	12000	32.0952	91	9	32.0952

== Status ==
Current time: 2021-12-01 23:20:25 (running for 00:05:41.62)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	4	298.268	16000	42.16	97	12	42.16

== Status ==
Current time: 2021-12-01 23:20:30 (running for 00:05:46.71)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	4	298.268	16000	42.16	97	12	42.16

== Status ==
Current time: 2021-12-01 23:20:35 (running for 00:05:51.93)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	4	298.268	16000	42.16	97	12	42.16

== Status ==
Current time: 2021-12-01 23:20:41 (running for 00:05:57.61)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	4	298.268	16000	42.16	97	12	42.16

== Status ==
Current time: 2021-12-01 23:20:46 (running for 00:06:02.79)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	4	298.268	16000	42.16	97	12	42.16

== Status ==
Current time: 2021-12-01 23:20:51 (running for 00:06:08.06)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	4	298.268	16000	42.16	97	12	42.16

== Status ==
Current time: 2021-12-01 23:20:56 (running for 00:06:13.36)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	4	298.268	16000	42.16	97	12	42.16

== Status ==
Current time: 2021-12-01 23:21:02 (running for 00:06:18.71)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	4	298.268	16000	42.16	97	12	42.16

== Status ==
Current time: 2021-12-01 23:21:07 (running for 00:06:23.95)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	4	298.268	16000	42.16	97	12	42.16

== Status ==
Current time: 2021-12-01 23:21:13 (running for 00:06:29.72)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	4	298.268	16000	42.16	97	12	42.16

== Status ==
Current time: 2021-12-01 23:21:18 (running for 00:06:35.43)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	4	298.268	16000	42.16	97	12	42.16

== Status ==
Current time: 2021-12-01 23:21:24 (running for 00:06:41.36)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	4	298.268	16000	42.16	97	12	42.16

== Status ==
Current time: 2021-12-01 23:21:30 (running for 00:06:47.02)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	4	298.268	16000	42.16	97	12	42.16

== Status ==
Current time: 2021-12-01 23:21:36 (running for 00:06:52.91)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	4	298.268	16000	42.16	97	12	42.16

== Status ==
Current time: 2021-12-01 23:21:41 (running for 00:06:58.10)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	4	298.268	16000	42.16	97	12	42.16

== Status ==
Current time: 2021-12-01 23:21:46 (running for 00:07:03.31)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	5	381.94	20000	33.3697	103	10	33.3697

== Status ==
Current time: 2021-12-01 23:21:51 (running for 00:07:08.39)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	5	381.94	20000	33.3697	103	10	33.3697

== Status ==
Current time: 2021-12-01 23:21:57 (running for 00:07:13.54)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	5	381.94	20000	33.3697	103	10	33.3697

== Status ==
Current time: 2021-12-01 23:22:02 (running for 00:07:18.90)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	5	381.94	20000	33.3697	103	10	33.3697

== Status ==
Current time: 2021-12-01 23:22:07 (running for 00:07:24.11)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	5	381.94	20000	33.3697	103	10	33.3697

== Status ==
Current time: 2021-12-01 23:22:13 (running for 00:07:29.46)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	5	381.94	20000	33.3697	103	10	33.3697

== Status ==
Current time: 2021-12-01 23:22:18 (running for 00:07:34.60)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	5	381.94	20000	33.3697	103	10	33.3697

== Status ==
Current time: 2021-12-01 23:22:23 (running for 00:07:39.79)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	5	381.94	20000	33.3697	103	10	33.3697

== Status ==
Current time: 2021-12-01 23:22:28 (running for 00:07:44.86)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	5	381.94	20000	33.3697	103	10	33.3697

== Status ==
Current time: 2021-12-01 23:22:33 (running for 00:07:50.17)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	5	381.94	20000	33.3697	103	10	33.3697

== Status ==
Current time: 2021-12-01 23:22:38 (running for 00:07:55.29)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	5	381.94	20000	33.3697	103	10	33.3697

== Status ==
Current time: 2021-12-01 23:22:44 (running for 00:08:00.65)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	5	381.94	20000	33.3697	103	10	33.3697

== Status ==
Current time: 2021-12-01 23:22:49 (running for 00:08:05.79)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	5	381.94	20000	33.3697	103	10	33.3697

== Status ==
Current time: 2021-12-01 23:22:54 (running for 00:08:11.30)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	5	381.94	20000	33.3697	103	10	33.3697

== Status ==
Current time: 2021-12-01 23:23:00 (running for 00:08:16.92)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	6	452.454	24000	39.95	100	9	39.95

== Status ==
Current time: 2021-12-01 23:23:05 (running for 00:08:22.07)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	6	452.454	24000	39.95	100	9	39.95

== Status ==
Current time: 2021-12-01 23:23:10 (running for 00:08:27.19)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	6	452.454	24000	39.95	100	9	39.95

== Status ==
Current time: 2021-12-01 23:23:16 (running for 00:08:32.58)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	6	452.454	24000	39.95	100	9	39.95

== Status ==
Current time: 2021-12-01 23:23:21 (running for 00:08:37.69)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	6	452.454	24000	39.95	100	9	39.95

== Status ==
Current time: 2021-12-01 23:23:26 (running for 00:08:42.86)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	6	452.454	24000	39.95	100	9	39.95

== Status ==
Current time: 2021-12-01 23:23:31 (running for 00:08:47.97)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	6	452.454	24000	39.95	100	9	39.95

== Status ==
Current time: 2021-12-01 23:23:36 (running for 00:08:53.14)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	6	452.454	24000	39.95	100	9	39.95

== Status ==
Current time: 2021-12-01 23:23:41 (running for 00:08:58.26)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	6	452.454	24000	39.95	100	9	39.95

== Status ==
Current time: 2021-12-01 23:23:47 (running for 00:09:03.47)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	6	452.454	24000	39.95	100	9	39.95

== Status ==
Current time: 2021-12-01 23:23:52 (running for 00:09:08.58)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	6	452.454	24000	39.95	100	9	39.95

== Status ==
Current time: 2021-12-01 23:23:57 (running for 00:09:13.72)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	6	452.454	24000	39.95	100	9	39.95

== Status ==
Current time: 2021-12-01 23:24:02 (running for 00:09:19.13)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	7	513.63	28000	34.3846	93	9	34.3846

== Status ==
Current time: 2021-12-01 23:24:07 (running for 00:09:24.41)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	7	513.63	28000	34.3846	93	9	34.3846

== Status ==
Current time: 2021-12-01 23:24:13 (running for 00:09:29.49)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	7	513.63	28000	34.3846	93	9	34.3846

== Status ==
Current time: 2021-12-01 23:24:18 (running for 00:09:34.71)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	7	513.63	28000	34.3846	93	9	34.3846

== Status ==
Current time: 2021-12-01 23:24:23 (running for 00:09:39.84)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	7	513.63	28000	34.3846	93	9	34.3846

== Status ==
Current time: 2021-12-01 23:24:28 (running for 00:09:45.06)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	7	513.63	28000	34.3846	93	9	34.3846

== Status ==
Current time: 2021-12-01 23:24:33 (running for 00:09:50.14)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	7	513.63	28000	34.3846	93	9	34.3846

== Status ==
Current time: 2021-12-01 23:24:38 (running for 00:09:55.34)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	7	513.63	28000	34.3846	93	9	34.3846

== Status ==
Current time: 2021-12-01 23:24:43 (running for 00:10:00.44)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	7	513.63	28000	34.3846	93	9	34.3846

== Status ==
Current time: 2021-12-01 23:24:49 (running for 00:10:05.64)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	7	513.63	28000	34.3846	93	9	34.3846

== Status ==
Current time: 2021-12-01 23:24:54 (running for 00:10:10.74)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	7	513.63	28000	34.3846	93	9	34.3846

== Status ==
Current time: 2021-12-01 23:25:00 (running for 00:10:16.66)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	8	573.102	32000	36.4495	77	11	36.4495

== Status ==
Current time: 2021-12-01 23:25:05 (running for 00:10:21.77)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	8	573.102	32000	36.4495	77	11	36.4495

== Status ==
Current time: 2021-12-01 23:25:10 (running for 00:10:26.94)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	8	573.102	32000	36.4495	77	11	36.4495

== Status ==
Current time: 2021-12-01 23:25:15 (running for 00:10:32.14)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	8	573.102	32000	36.4495	77	11	36.4495

== Status ==
Current time: 2021-12-01 23:25:20 (running for 00:10:37.34)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	8	573.102	32000	36.4495	77	11	36.4495

== Status ==
Current time: 2021-12-01 23:25:26 (running for 00:10:42.45)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	8	573.102	32000	36.4495	77	11	36.4495

== Status ==
Current time: 2021-12-01 23:25:31 (running for 00:10:47.60)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	8	573.102	32000	36.4495	77	11	36.4495

== Status ==
Current time: 2021-12-01 23:25:36 (running for 00:10:52.70)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	8	573.102	32000	36.4495	77	11	36.4495

== Status ==
Current time: 2021-12-01 23:25:41 (running for 00:10:57.88)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	8	573.102	32000	36.4495	77	11	36.4495

== Status ==
Current time: 2021-12-01 23:25:46 (running for 00:11:03.02)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	8	573.102	32000	36.4495	77	11	36.4495

== Status ==
Current time: 2021-12-01 23:25:51 (running for 00:11:08.22)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	8	573.102	32000	36.4495	77	11	36.4495

== Status ==
Current time: 2021-12-01 23:25:56 (running for 00:11:13.34)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	8	573.102	32000	36.4495	77	11	36.4495

== Status ==
Current time: 2021-12-01 23:26:02 (running for 00:11:19.16)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	9	633.545	36000	34.807	93	10	34.807

== Status ==
Current time: 2021-12-01 23:26:07 (running for 00:11:24.27)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	9	633.545	36000	34.807	93	10	34.807

== Status ==
Current time: 2021-12-01 23:26:13 (running for 00:11:29.46)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	9	633.545	36000	34.807	93	10	34.807

== Status ==
Current time: 2021-12-01 23:26:18 (running for 00:11:34.59)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	9	633.545	36000	34.807	93	10	34.807

== Status ==
Current time: 2021-12-01 23:26:23 (running for 00:11:39.78)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	9	633.545	36000	34.807	93	10	34.807

== Status ==
Current time: 2021-12-01 23:26:28 (running for 00:11:44.93)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	9	633.545	36000	34.807	93	10	34.807

== Status ==
Current time: 2021-12-01 23:26:33 (running for 00:11:50.05)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	9	633.545	36000	34.807	93	10	34.807

== Status ==
Current time: 2021-12-01 23:26:38 (running for 00:11:55.15)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	9	633.545	36000	34.807	93	10	34.807

== Status ==
Current time: 2021-12-01 23:26:43 (running for 00:12:00.31)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	9	633.545	36000	34.807	93	10	34.807

== Status ==
Current time: 2021-12-01 23:26:48 (running for 00:12:05.38)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	9	633.545	36000	34.807	93	10	34.807

== Status ==
Current time: 2021-12-01 23:26:54 (running for 00:12:10.58)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	RUNNING	127.0.0.1:14344	9	633.545	36000	34.807	93	10	34.807

== Status ==
Current time: 2021-12-01 23:26:57 (running for 00:12:13.80)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 TERMINATED)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_15f17_00000	TERMINATED	127.0.0.1:14344	10	693.094	40000	41.01	91	9	41.01

print_reward(results3b)

Reward after 10 training iterations: 41.01

plot_rewards(results3b)

c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

plot_learning(results1, label="1: Full Observations")
plot_learning(results2, label="2: Partial Observations")
plot_learning(results3a, label="3a: Stacked, Partial Observations")
plot_learning(results3b, label="3b: LSTM")

LSTM with Stacked Observations

Using the StackedStatelessCartPole from above.

#collapse-output

config3b2 = ppo.DEFAULT_CONFIG.copy()
config3b2["env"] = "StackedStatelessCartPole"
config3b2["model"] = {
    "use_lstm": True,
}

results3b2 = ray.tune.run("PPO", config=config3b2, stop=stop)
print("Option 3b2: Training finished successfully")

== Status ==
Current time: 2021-12-01 23:29:25 (running for 00:00:00.15)
Memory usage on this node: 9.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 PENDING)

Trial name	status	loc
PPO_StackedStatelessCartPole_23a62_00000	PENDING

== Status ==
Current time: 2021-12-01 23:29:30 (running for 00:00:05.16)
Memory usage on this node: 9.5/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 PENDING)

Trial name	status	loc
PPO_StackedStatelessCartPole_23a62_00000	PENDING

(pid=None) c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\redis\connection.py:77: UserWarning: redis-py works best with hiredis. Please consider installing
(pid=None)   warnings.warn(msg)
(pid=10736) 2021-12-01 23:29:41,957 INFO trainer.py:753 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
(pid=10736) 2021-12-01 23:29:41,957 INFO ppo.py:166 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
(pid=10736) 2021-12-01 23:29:41,958 INFO trainer.py:770 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(pid=None) c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\redis\connection.py:77: UserWarning: redis-py works best with hiredis. Please consider installing
(pid=None)   warnings.warn(msg)
(pid=None) c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\redis\connection.py:77: UserWarning: redis-py works best with hiredis. Please consider installing
(pid=None)   warnings.warn(msg)
(pid=11688) c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\gym\spaces\box.py:142: UserWarning: WARN: Casting input x to numpy array.
(pid=11688)   logger.warn("Casting input x to numpy array.")
(pid=19560) c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\gym\spaces\box.py:142: UserWarning: WARN: Casting input x to numpy array.
(pid=19560)   logger.warn("Casting input x to numpy array.")
(pid=19560) 2021-12-01 23:29:54,372 WARNING deprecation.py:38 -- DeprecationWarning: `SampleBatch['is_training']` has been deprecated. Use `SampleBatch.is_training` instead. This will raise an error in the future!
(pid=10736) 2021-12-01 23:29:57,640 WARNING deprecation.py:38 -- DeprecationWarning: `SampleBatch['is_training']` has been deprecated. Use `SampleBatch.is_training` instead. This will raise an error in the future!
(pid=10736) 2021-12-01 23:29:59,755 WARNING trainer_template.py:185 -- `execution_plan` functions should accept `trainer`, `workers`, and `config` as args!
(pid=10736) 2021-12-01 23:29:59,755 INFO trainable.py:110 -- Trainable.setup took 17.805 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
(pid=10736) 2021-12-01 23:29:59,755 WARNING util.py:57 -- Install gputil for GPU system monitoring.
(pid=10736) 2021-12-01 23:30:05,789 WARNING deprecation.py:38 -- DeprecationWarning: `slice` has been deprecated. Use `SampleBatch[start:stop]` instead. This will raise an error in the future!
(pid=10736) [2021-12-01 23:39:23,601 E 10736 16912] raylet_client.cc:159: IOError: Unknown error [RayletClient] Failed to disconnect from raylet.
(pid=10736) Windows fatal exception: access violation
(pid=10736) 
(pid=19560) [2021-12-01 23:39:23,630 C 19560 20340] core_worker.cc:796:  Check failed: _s.ok() Bad status: IOError: Unknown error
(pid=19560) *** StackTrace Information ***
(pid=19560)     PyInit__raylet
(pid=19560)     PyInit__raylet
(pid=19560)     PyInit__raylet
(pid=19560)     PyInit__raylet
(pid=19560)     PyInit__raylet
(pid=19560)     PyInit__raylet
(pid=19560)     PyInit__raylet
(pid=19560)     PyInit__raylet
(pid=19560)     PyInit__raylet
(pid=19560)     PyInit__raylet
(pid=19560)     PyInit__raylet
(pid=19560)     PyInit__raylet
(pid=19560)     PyInit__raylet
(pid=19560)     PyInit__raylet
(pid=19560)     PyInit__raylet
(pid=19560)     PyInit__raylet
(pid=19560)     PyNumber_InPlaceLshift
(pid=19560)     Py_CheckFunctionResult
(pid=19560)     PyEval_EvalFrameDefault
(pid=19560)     Py_CheckFunctionResult
(pid=19560)     PyEval_EvalFrameDefault
(pid=19560)     PyEval_EvalCodeWithName
(pid=19560)     PyEval_EvalCodeEx
(pid=19560)     PyEval_EvalCode
(pid=19560)     PyArena_New
(pid=19560)     PyArena_New
(pid=19560)     PyRun_FileExFlags
(pid=19560)     PyRun_SimpleFileExFlags
(pid=19560)     PyRun_AnyFileExFlags
(pid=19560)     Py_FatalError
(pid=19560)     Py_RunMain
(pid=19560)     Py_RunMain
(pid=19560)     Py_Main
(pid=19560)     BaseThreadInitThunk
(pid=19560)     RtlUserThreadStart
(pid=19560) 
(pid=11688) [2021-12-01 23:39:23,637 C 11688 12384] core_worker.cc:796:  Check failed: _s.ok() Bad status: IOError: Unknown error
(pid=11688) *** StackTrace Information ***
(pid=11688)     PyInit__raylet
(pid=11688)     PyInit__raylet
(pid=11688)     PyInit__raylet
(pid=11688)     PyInit__raylet
(pid=11688)     PyInit__raylet
(pid=11688)     PyInit__raylet
(pid=11688)     PyInit__raylet
(pid=11688)     PyInit__raylet
(pid=11688)     PyInit__raylet
(pid=11688)     PyInit__raylet
(pid=11688)     PyInit__raylet
(pid=11688)     PyInit__raylet
(pid=11688)     PyInit__raylet
(pid=11688)     PyInit__raylet
(pid=11688)     PyInit__raylet
(pid=11688)     PyInit__raylet
(pid=11688)     PyNumber_InPlaceLshift
(pid=11688)     Py_CheckFunctionResult
(pid=11688)     PyEval_EvalFrameDefault
(pid=11688)     Py_CheckFunctionResult
(pid=11688)     PyEval_EvalFrameDefault
(pid=11688)     PyEval_EvalCodeWithName
(pid=11688)     PyEval_EvalCodeEx
(pid=11688)     PyEval_EvalCode
(pid=11688)     PyArena_New
(pid=11688)     PyArena_New
(pid=11688)     PyRun_FileExFlags
(pid=11688)     PyRun_SimpleFileExFlags
(pid=11688)     PyRun_AnyFileExFlags
(pid=11688)     Py_FatalError
(pid=11688)     Py_RunMain
(pid=11688)     Py_RunMain
(pid=11688)     Py_Main
(pid=11688)     BaseThreadInitThunk
(pid=11688)     RtlUserThreadStart
(pid=11688) 
(pid=19560) Windows fatal exception: access violation
(pid=19560) 
(pid=19560) Stack (most recent call first):
(pid=19560)   File "c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\ray\worker.py", line 425 in main_loop
(pid=19560)   File "c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\ray\workers/default_worker.py", line 218 in <module>
(pid=11688) Windows fatal exception: access violation
(pid=11688) 
(pid=11688) Stack (most recent call first):
(pid=11688)   File "c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\ray\worker.py", line 425 in main_loop
(pid=11688)   File "c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\ray\workers/default_worker.py", line 218 in <module>
2021-12-01 23:39:23,731 INFO tune.py:630 -- Total run time: 598.23 seconds (597.55 seconds for the tuning loop).

== Status ==
Current time: 2021-12-01 23:29:59 (running for 00:00:34.23)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736

== Status ==
Current time: 2021-12-01 23:30:00 (running for 00:00:35.26)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736

== Status ==
Current time: 2021-12-01 23:30:05 (running for 00:00:40.36)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736

== Status ==
Current time: 2021-12-01 23:30:11 (running for 00:00:45.58)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736

== Status ==
Current time: 2021-12-01 23:30:16 (running for 00:00:50.71)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736

== Status ==
Current time: 2021-12-01 23:30:21 (running for 00:00:55.84)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736

== Status ==
Current time: 2021-12-01 23:30:26 (running for 00:01:00.96)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736

== Status ==
Current time: 2021-12-01 23:30:31 (running for 00:01:06.11)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736

== Status ==
Current time: 2021-12-01 23:30:36 (running for 00:01:11.19)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736

== Status ==
Current time: 2021-12-01 23:30:41 (running for 00:01:16.30)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736

== Status ==
Current time: 2021-12-01 23:30:46 (running for 00:01:21.41)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736

== Status ==
Current time: 2021-12-01 23:30:52 (running for 00:01:26.58)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736

Result for PPO_StackedStatelessCartPole_23a62_00000:
  agent_timesteps_total: 4000
  custom_metrics: {}
  date: 2021-12-01_23-30-56
  done: false
  episode_len_mean: 23.75
  episode_media: {}
  episode_reward_max: 76.0
  episode_reward_mean: 23.75
  episode_reward_min: 8.0
  episodes_this_iter: 168
  episodes_total: 168
  experiment_id: a83f6e57239f4aa1a70a247399bd5e70
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.67047119140625
          entropy_coeff: 0.0
          kl: 0.01581866294145584
          model: {}
          policy_loss: -0.01895357482135296
          total_loss: 154.9998321533203
          vf_explained_var: -0.10604370385408401
          vf_loss: 155.015625
    num_agent_steps_sampled: 4000
    num_agent_steps_trained: 4000
    num_steps_sampled: 4000
    num_steps_trained: 4000
  iterations_since_restore: 1
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 92.95384615384614
    ram_util_percent: 86.28846153846153
  pid: 10736
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.12315661179645192
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.14159590390983534
    mean_inference_ms: 2.4392604373977185
    mean_raw_obs_processing_ms: 0.21398437689550173
  time_since_restore: 56.266863107681274
  time_this_iter_s: 56.266863107681274
  time_total_s: 56.266863107681274
  timers:
    learn_throughput: 79.628
    learn_time_ms: 50233.338
    load_throughput: 0.0
    load_time_ms: 0.0
    sample_throughput: 662.919
    sample_time_ms: 6033.918
    update_time_ms: 0.0
  timestamp: 1638397856
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 4000
  training_iteration: 1
  trial_id: 23a62_00000
  
Result for PPO_StackedStatelessCartPole_23a62_00000:
  agent_timesteps_total: 8000
  custom_metrics: {}
  date: 2021-12-01_23-31-51
  done: false
  episode_len_mean: 26.83783783783784
  episode_media: {}
  episode_reward_max: 99.0
  episode_reward_mean: 26.83783783783784
  episode_reward_min: 9.0
  episodes_this_iter: 148
  episodes_total: 316
  experiment_id: a83f6e57239f4aa1a70a247399bd5e70
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6504567265510559
          entropy_coeff: 0.0
          kl: 0.008978299796581268
          model: {}
          policy_loss: -0.003679021494463086
          total_loss: 127.99153137207031
          vf_explained_var: 0.08064287155866623
          vf_loss: 127.99342346191406
    num_agent_steps_sampled: 8000
    num_agent_steps_trained: 8000
    num_steps_sampled: 8000
    num_steps_trained: 8000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 2
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 93.6051948051948
    ram_util_percent: 85.9285714285714
  pid: 10736
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.11010479833490981
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.1344388731949505
    mean_inference_ms: 2.3811965957541155
    mean_raw_obs_processing_ms: 0.23579910601161452
  time_since_restore: 111.88376545906067
  time_this_iter_s: 55.616902351379395
  time_total_s: 111.88376545906067
  timers:
    learn_throughput: 79.946
    learn_time_ms: 50033.46
    load_throughput: 0.0
    load_time_ms: 0.0
    sample_throughput: 128.91
    sample_time_ms: 31029.425
    update_time_ms: 7.819
  timestamp: 1638397911
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 8000
  training_iteration: 2
  trial_id: 23a62_00000
  
Result for PPO_StackedStatelessCartPole_23a62_00000:
  agent_timesteps_total: 12000
  custom_metrics: {}
  date: 2021-12-01_23-32-46
  done: false
  episode_len_mean: 31.140625
  episode_media: {}
  episode_reward_max: 84.0
  episode_reward_mean: 31.140625
  episode_reward_min: 9.0
  episodes_this_iter: 128
  episodes_total: 444
  experiment_id: a83f6e57239f4aa1a70a247399bd5e70
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6188111901283264
          entropy_coeff: 0.0
          kl: 0.01211484707891941
          model: {}
          policy_loss: -0.0035639703273773193
          total_loss: 120.47854614257812
          vf_explained_var: 0.1551436185836792
          vf_loss: 120.47969055175781
    num_agent_steps_sampled: 12000
    num_agent_steps_trained: 12000
    num_steps_sampled: 12000
    num_steps_trained: 12000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 3
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 93.38815789473684
    ram_util_percent: 85.94868421052632
  pid: 10736
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.11705200286720863
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.1417983632483218
    mean_inference_ms: 2.3423456092602697
    mean_raw_obs_processing_ms: 0.24562951880700387
  time_since_restore: 167.125750541687
  time_this_iter_s: 55.24198508262634
  time_total_s: 167.125750541687
  timers:
    learn_throughput: 80.241
    learn_time_ms: 49850.087
    load_throughput: 0.0
    load_time_ms: 0.0
    sample_throughput: 101.955
    sample_time_ms: 39232.888
    update_time_ms: 6.891
  timestamp: 1638397966
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 12000
  training_iteration: 3
  trial_id: 23a62_00000
  
Result for PPO_StackedStatelessCartPole_23a62_00000:
  agent_timesteps_total: 16000
  custom_metrics: {}
  date: 2021-12-01_23-33-42
  done: false
  episode_len_mean: 30.353383458646615
  episode_media: {}
  episode_reward_max: 90.0
  episode_reward_mean: 30.353383458646615
  episode_reward_min: 10.0
  episodes_this_iter: 133
  episodes_total: 577
  experiment_id: a83f6e57239f4aa1a70a247399bd5e70
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6017324328422546
          entropy_coeff: 0.0
          kl: 0.01713641546666622
          model: {}
          policy_loss: -0.01428857073187828
          total_loss: 144.1490936279297
          vf_explained_var: 0.12213249504566193
          vf_loss: 144.15994262695312
    num_agent_steps_sampled: 16000
    num_agent_steps_trained: 16000
    num_steps_sampled: 16000
    num_steps_trained: 16000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 4
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 94.54285714285712
    ram_util_percent: 85.35714285714288
  pid: 10736
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.11847087716647563
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.1291269029294712
    mean_inference_ms: 2.3499539409316808
    mean_raw_obs_processing_ms: 0.24600572504675797
  time_since_restore: 222.71135187149048
  time_this_iter_s: 55.58560132980347
  time_total_s: 222.71135187149048
  timers:
    learn_throughput: 80.284
    learn_time_ms: 49823.259
    load_throughput: 0.0
    load_time_ms: 0.0
    sample_throughput: 92.442
    sample_time_ms: 43270.364
    update_time_ms: 6.418
  timestamp: 1638398022
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 16000
  training_iteration: 4
  trial_id: 23a62_00000
  
Result for PPO_StackedStatelessCartPole_23a62_00000:
  agent_timesteps_total: 20000
  custom_metrics: {}
  date: 2021-12-01_23-34-39
  done: false
  episode_len_mean: 32.04032258064516
  episode_media: {}
  episode_reward_max: 95.0
  episode_reward_mean: 32.04032258064516
  episode_reward_min: 9.0
  episodes_this_iter: 124
  episodes_total: 701
  experiment_id: a83f6e57239f4aa1a70a247399bd5e70
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.5604901909828186
          entropy_coeff: 0.0
          kl: 0.009820478968322277
          model: {}
          policy_loss: -0.0032165604643523693
          total_loss: 100.9627914428711
          vf_explained_var: 0.21908660233020782
          vf_loss: 100.96404266357422
    num_agent_steps_sampled: 20000
    num_agent_steps_trained: 20000
    num_steps_sampled: 20000
    num_steps_trained: 20000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 5
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 95.19487179487179
    ram_util_percent: 85.38333333333334
  pid: 10736
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.11550202333653888
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.12803896103049967
    mean_inference_ms: 2.3503844656975446
    mean_raw_obs_processing_ms: 0.24389184124832392
  time_since_restore: 279.80817222595215
  time_this_iter_s: 57.09682035446167
  time_total_s: 279.80817222595215
  timers:
    learn_throughput: 79.803
    learn_time_ms: 50123.572
    load_throughput: 0.0
    load_time_ms: 0.0
    sample_throughput: 87.481
    sample_time_ms: 45724.201
    update_time_ms: 6.735
  timestamp: 1638398079
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 20000
  training_iteration: 5
  trial_id: 23a62_00000
  
Result for PPO_StackedStatelessCartPole_23a62_00000:
  agent_timesteps_total: 24000
  custom_metrics: {}
  date: 2021-12-01_23-35-37
  done: false
  episode_len_mean: 34.64655172413793
  episode_media: {}
  episode_reward_max: 73.0
  episode_reward_mean: 34.64655172413793
  episode_reward_min: 9.0
  episodes_this_iter: 116
  episodes_total: 817
  experiment_id: a83f6e57239f4aa1a70a247399bd5e70
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.5657119154930115
          entropy_coeff: 0.0
          kl: 0.013083796948194504
          model: {}
          policy_loss: -0.002956786658614874
          total_loss: 113.45106506347656
          vf_explained_var: 0.19632089138031006
          vf_loss: 113.45140075683594
    num_agent_steps_sampled: 24000
    num_agent_steps_trained: 24000
    num_steps_sampled: 24000
    num_steps_trained: 24000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 6
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 96.2126582278481
    ram_util_percent: 85.9253164556962
  pid: 10736
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.11478686430048778
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.1277210771833874
    mean_inference_ms: 2.3750658085342033
    mean_raw_obs_processing_ms: 0.2452229849547805
  time_since_restore: 337.58967638015747
  time_this_iter_s: 57.78150415420532
  time_total_s: 337.58967638015747
  timers:
    learn_throughput: 79.396
    learn_time_ms: 50380.468
    load_throughput: 0.0
    load_time_ms: 0.0
    sample_throughput: 83.877
    sample_time_ms: 47688.802
    update_time_ms: 6.779
  timestamp: 1638398137
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 24000
  training_iteration: 6
  trial_id: 23a62_00000
  
Result for PPO_StackedStatelessCartPole_23a62_00000:
  agent_timesteps_total: 28000
  custom_metrics: {}
  date: 2021-12-01_23-36-34
  done: false
  episode_len_mean: 33.652542372881356
  episode_media: {}
  episode_reward_max: 80.0
  episode_reward_mean: 33.652542372881356
  episode_reward_min: 10.0
  episodes_this_iter: 118
  episodes_total: 935
  experiment_id: a83f6e57239f4aa1a70a247399bd5e70
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.55117267370224
          entropy_coeff: 0.0
          kl: 0.007611136883497238
          model: {}
          policy_loss: 0.007003166247159243
          total_loss: 101.61392211914062
          vf_explained_var: 0.2326377034187317
          vf_loss: 101.60539245605469
    num_agent_steps_sampled: 28000
    num_agent_steps_trained: 28000
    num_steps_sampled: 28000
    num_steps_trained: 28000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 7
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 97.20759493670884
    ram_util_percent: 83.11772151898732
  pid: 10736
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.1171131524330598
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.13283579428360615
    mean_inference_ms: 2.437068269845546
    mean_raw_obs_processing_ms: 0.24942217294799132
  time_since_restore: 394.7531487941742
  time_this_iter_s: 57.163472414016724
  time_total_s: 394.7531487941742
  timers:
    learn_throughput: 79.423
    learn_time_ms: 50363.0
    load_throughput: 0.0
    load_time_ms: 0.0
    sample_throughput: 81.207
    sample_time_ms: 49256.55
    update_time_ms: 5.81
  timestamp: 1638398194
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 28000
  training_iteration: 7
  trial_id: 23a62_00000
  
Result for PPO_StackedStatelessCartPole_23a62_00000:
  agent_timesteps_total: 32000
  custom_metrics: {}
  date: 2021-12-01_23-37-31
  done: false
  episode_len_mean: 34.30769230769231
  episode_media: {}
  episode_reward_max: 83.0
  episode_reward_mean: 34.30769230769231
  episode_reward_min: 9.0
  episodes_this_iter: 117
  episodes_total: 1052
  experiment_id: a83f6e57239f4aa1a70a247399bd5e70
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.5451725721359253
          entropy_coeff: 0.0
          kl: 0.009232791140675545
          model: {}
          policy_loss: -0.004543236922472715
          total_loss: 98.80670928955078
          vf_explained_var: 0.2929261326789856
          vf_loss: 98.80941772460938
    num_agent_steps_sampled: 32000
    num_agent_steps_trained: 32000
    num_steps_sampled: 32000
    num_steps_trained: 32000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 8
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 95.07692307692308
    ram_util_percent: 83.00512820512823
  pid: 10736
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.11435559950687862
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.12772642169718973
    mean_inference_ms: 2.4294528410907237
    mean_raw_obs_processing_ms: 0.2475755621430507
  time_since_restore: 450.92096877098083
  time_this_iter_s: 56.16781997680664
  time_total_s: 450.92096877098083
  timers:
    learn_throughput: 79.407
    learn_time_ms: 50373.586
    load_throughput: 0.0
    load_time_ms: 0.0
    sample_throughput: 79.838
    sample_time_ms: 50101.286
    update_time_ms: 5.584
  timestamp: 1638398251
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 32000
  training_iteration: 8
  trial_id: 23a62_00000
  
Result for PPO_StackedStatelessCartPole_23a62_00000:
  agent_timesteps_total: 36000
  custom_metrics: {}
  date: 2021-12-01_23-38-27
  done: false
  episode_len_mean: 39.45544554455446
  episode_media: {}
  episode_reward_max: 88.0
  episode_reward_mean: 39.45544554455446
  episode_reward_min: 10.0
  episodes_this_iter: 101
  episodes_total: 1153
  experiment_id: a83f6e57239f4aa1a70a247399bd5e70
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.5534942746162415
          entropy_coeff: 0.0
          kl: 0.008644542656838894
          model: {}
          policy_loss: 0.000738372968044132
          total_loss: 82.19181823730469
          vf_explained_var: 0.3500000238418579
          vf_loss: 82.18933868408203
    num_agent_steps_sampled: 36000
    num_agent_steps_trained: 36000
    num_steps_sampled: 36000
    num_steps_trained: 36000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 9
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 95.58076923076923
    ram_util_percent: 82.91666666666666
  pid: 10736
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.1135820346886012
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.12853882753117699
    mean_inference_ms: 2.4140773796778965
    mean_raw_obs_processing_ms: 0.24722175705109586
  time_since_restore: 507.0870122909546
  time_this_iter_s: 56.166043519973755
  time_total_s: 507.0870122909546
  timers:
    learn_throughput: 79.385
    learn_time_ms: 50387.098
    load_throughput: 0.0
    load_time_ms: 0.0
    sample_throughput: 78.782
    sample_time_ms: 50772.981
    update_time_ms: 6.076
  timestamp: 1638398307
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 36000
  training_iteration: 9
  trial_id: 23a62_00000
  
Result for PPO_StackedStatelessCartPole_23a62_00000:
  agent_timesteps_total: 40000
  custom_metrics: {}
  date: 2021-12-01_23-39-23
  done: true
  episode_len_mean: 36.67272727272727
  episode_media: {}
  episode_reward_max: 80.0
  episode_reward_mean: 36.67272727272727
  episode_reward_min: 11.0
  episodes_this_iter: 110
  episodes_total: 1263
  experiment_id: a83f6e57239f4aa1a70a247399bd5e70
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.5353969931602478
          entropy_coeff: 0.0
          kl: 0.007495984435081482
          model: {}
          policy_loss: -0.003161693923175335
          total_loss: 78.63844299316406
          vf_explained_var: 0.3598953187465668
          vf_loss: 78.64009857177734
    num_agent_steps_sampled: 40000
    num_agent_steps_trained: 40000
    num_steps_sampled: 40000
    num_steps_trained: 40000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 10
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 94.4233766233766
    ram_util_percent: 82.21298701298701
  pid: 10736
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.11409118320273612
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.13058247747347188
    mean_inference_ms: 2.398278065616303
    mean_raw_obs_processing_ms: 0.2469977443652814
  time_since_restore: 562.8484704494476
  time_this_iter_s: 55.76145815849304
  time_total_s: 562.8484704494476
  timers:
    learn_throughput: 79.43
    learn_time_ms: 50358.561
    load_throughput: 0.0
    load_time_ms: 0.0
    sample_throughput: 77.946
    sample_time_ms: 51317.629
    update_time_ms: 5.469
  timestamp: 1638398363
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 40000
  training_iteration: 10
  trial_id: 23a62_00000
  
Option 3b2: Training finished successfully

== Status ==
Current time: 2021-12-01 23:30:59 (running for 00:01:33.56)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	1	56.2669	4000	23.75	76	8	23.75

== Status ==
Current time: 2021-12-01 23:31:04 (running for 00:01:38.73)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	1	56.2669	4000	23.75	76	8	23.75

== Status ==
Current time: 2021-12-01 23:31:09 (running for 00:01:43.86)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	1	56.2669	4000	23.75	76	8	23.75

== Status ==
Current time: 2021-12-01 23:31:14 (running for 00:01:49.03)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	1	56.2669	4000	23.75	76	8	23.75

== Status ==
Current time: 2021-12-01 23:31:19 (running for 00:01:54.11)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	1	56.2669	4000	23.75	76	8	23.75

== Status ==
Current time: 2021-12-01 23:31:24 (running for 00:01:59.28)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	1	56.2669	4000	23.75	76	8	23.75

== Status ==
Current time: 2021-12-01 23:31:29 (running for 00:02:04.40)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	1	56.2669	4000	23.75	76	8	23.75

== Status ==
Current time: 2021-12-01 23:31:35 (running for 00:02:09.57)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	1	56.2669	4000	23.75	76	8	23.75

== Status ==
Current time: 2021-12-01 23:31:40 (running for 00:02:14.67)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	1	56.2669	4000	23.75	76	8	23.75

== Status ==
Current time: 2021-12-01 23:31:45 (running for 00:02:19.82)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	1	56.2669	4000	23.75	76	8	23.75

== Status ==
Current time: 2021-12-01 23:31:50 (running for 00:02:24.91)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	1	56.2669	4000	23.75	76	8	23.75

== Status ==
Current time: 2021-12-01 23:31:55 (running for 00:02:30.26)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	2	111.884	8000	26.8378	99	9	26.8378

== Status ==
Current time: 2021-12-01 23:32:00 (running for 00:02:35.31)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	2	111.884	8000	26.8378	99	9	26.8378

== Status ==
Current time: 2021-12-01 23:32:06 (running for 00:02:40.49)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	2	111.884	8000	26.8378	99	9	26.8378

== Status ==
Current time: 2021-12-01 23:32:11 (running for 00:02:45.57)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	2	111.884	8000	26.8378	99	9	26.8378

== Status ==
Current time: 2021-12-01 23:32:16 (running for 00:02:50.74)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	2	111.884	8000	26.8378	99	9	26.8378

== Status ==
Current time: 2021-12-01 23:32:23 (running for 00:02:57.84)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	2	111.884	8000	26.8378	99	9	26.8378

== Status ==
Current time: 2021-12-01 23:32:28 (running for 00:03:03.01)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	2	111.884	8000	26.8378	99	9	26.8378

== Status ==
Current time: 2021-12-01 23:32:33 (running for 00:03:08.13)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	2	111.884	8000	26.8378	99	9	26.8378

== Status ==
Current time: 2021-12-01 23:32:38 (running for 00:03:13.31)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	2	111.884	8000	26.8378	99	9	26.8378

== Status ==
Current time: 2021-12-01 23:32:43 (running for 00:03:18.38)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	2	111.884	8000	26.8378	99	9	26.8378

== Status ==
Current time: 2021-12-01 23:32:49 (running for 00:03:23.49)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	3	167.126	12000	31.1406	84	9	31.1406

== Status ==
Current time: 2021-12-01 23:32:54 (running for 00:03:28.56)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	3	167.126	12000	31.1406	84	9	31.1406

== Status ==
Current time: 2021-12-01 23:32:59 (running for 00:03:33.85)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	3	167.126	12000	31.1406	84	9	31.1406

== Status ==
Current time: 2021-12-01 23:33:04 (running for 00:03:38.93)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	3	167.126	12000	31.1406	84	9	31.1406

== Status ==
Current time: 2021-12-01 23:33:09 (running for 00:03:44.06)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	3	167.126	12000	31.1406	84	9	31.1406

== Status ==
Current time: 2021-12-01 23:33:14 (running for 00:03:49.19)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	3	167.126	12000	31.1406	84	9	31.1406

== Status ==
Current time: 2021-12-01 23:33:19 (running for 00:03:54.41)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	3	167.126	12000	31.1406	84	9	31.1406

== Status ==
Current time: 2021-12-01 23:33:25 (running for 00:03:59.48)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	3	167.126	12000	31.1406	84	9	31.1406

== Status ==
Current time: 2021-12-01 23:33:30 (running for 00:04:04.68)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	3	167.126	12000	31.1406	84	9	31.1406

== Status ==
Current time: 2021-12-01 23:33:35 (running for 00:04:09.80)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	3	167.126	12000	31.1406	84	9	31.1406

== Status ==
Current time: 2021-12-01 23:33:40 (running for 00:04:14.94)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	3	167.126	12000	31.1406	84	9	31.1406

== Status ==
Current time: 2021-12-01 23:33:45 (running for 00:04:20.12)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	4	222.711	16000	30.3534	90	10	30.3534

== Status ==
Current time: 2021-12-01 23:33:50 (running for 00:04:25.23)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	4	222.711	16000	30.3534	90	10	30.3534

== Status ==
Current time: 2021-12-01 23:33:55 (running for 00:04:30.30)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	4	222.711	16000	30.3534	90	10	30.3534

== Status ==
Current time: 2021-12-01 23:34:00 (running for 00:04:35.44)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	4	222.711	16000	30.3534	90	10	30.3534

== Status ==
Current time: 2021-12-01 23:34:06 (running for 00:04:40.56)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	4	222.711	16000	30.3534	90	10	30.3534

== Status ==
Current time: 2021-12-01 23:34:11 (running for 00:04:45.73)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	4	222.711	16000	30.3534	90	10	30.3534

== Status ==
Current time: 2021-12-01 23:34:16 (running for 00:04:50.88)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	4	222.711	16000	30.3534	90	10	30.3534

== Status ==
Current time: 2021-12-01 23:34:21 (running for 00:04:56.00)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	4	222.711	16000	30.3534	90	10	30.3534

== Status ==
Current time: 2021-12-01 23:34:26 (running for 00:05:01.12)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	4	222.711	16000	30.3534	90	10	30.3534

== Status ==
Current time: 2021-12-01 23:34:31 (running for 00:05:06.30)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	4	222.711	16000	30.3534	90	10	30.3534

== Status ==
Current time: 2021-12-01 23:34:36 (running for 00:05:11.38)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	4	222.711	16000	30.3534	90	10	30.3534

== Status ==
Current time: 2021-12-01 23:34:42 (running for 00:05:17.29)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	5	279.808	20000	32.0403	95	9	32.0403

== Status ==
Current time: 2021-12-01 23:34:47 (running for 00:05:22.33)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	5	279.808	20000	32.0403	95	9	32.0403

== Status ==
Current time: 2021-12-01 23:34:53 (running for 00:05:27.47)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	5	279.808	20000	32.0403	95	9	32.0403

== Status ==
Current time: 2021-12-01 23:34:58 (running for 00:05:32.56)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	5	279.808	20000	32.0403	95	9	32.0403

== Status ==
Current time: 2021-12-01 23:35:04 (running for 00:05:38.75)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	5	279.808	20000	32.0403	95	9	32.0403

== Status ==
Current time: 2021-12-01 23:35:09 (running for 00:05:43.83)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	5	279.808	20000	32.0403	95	9	32.0403

== Status ==
Current time: 2021-12-01 23:35:14 (running for 00:05:49.08)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	5	279.808	20000	32.0403	95	9	32.0403

== Status ==
Current time: 2021-12-01 23:35:19 (running for 00:05:54.16)
Memory usage on this node: 10.3/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	5	279.808	20000	32.0403	95	9	32.0403

== Status ==
Current time: 2021-12-01 23:35:24 (running for 00:05:59.34)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	5	279.808	20000	32.0403	95	9	32.0403

== Status ==
Current time: 2021-12-01 23:35:30 (running for 00:06:04.49)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	5	279.808	20000	32.0403	95	9	32.0403

== Status ==
Current time: 2021-12-01 23:35:35 (running for 00:06:09.71)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	5	279.808	20000	32.0403	95	9	32.0403

== Status ==
Current time: 2021-12-01 23:35:40 (running for 00:06:15.17)
Memory usage on this node: 10.2/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	6	337.59	24000	34.6466	73	9	34.6466

== Status ==
Current time: 2021-12-01 23:35:45 (running for 00:06:20.23)
Memory usage on this node: 9.8/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	6	337.59	24000	34.6466	73	9	34.6466

== Status ==
Current time: 2021-12-01 23:35:50 (running for 00:06:25.36)
Memory usage on this node: 9.8/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	6	337.59	24000	34.6466	73	9	34.6466

== Status ==
Current time: 2021-12-01 23:35:56 (running for 00:06:30.51)
Memory usage on this node: 9.8/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	6	337.59	24000	34.6466	73	9	34.6466

== Status ==
Current time: 2021-12-01 23:36:01 (running for 00:06:35.67)
Memory usage on this node: 9.8/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	6	337.59	24000	34.6466	73	9	34.6466

== Status ==
Current time: 2021-12-01 23:36:06 (running for 00:06:40.82)
Memory usage on this node: 9.8/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	6	337.59	24000	34.6466	73	9	34.6466

== Status ==
Current time: 2021-12-01 23:36:11 (running for 00:06:45.91)
Memory usage on this node: 9.8/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	6	337.59	24000	34.6466	73	9	34.6466

== Status ==
Current time: 2021-12-01 23:36:16 (running for 00:06:51.04)
Memory usage on this node: 9.8/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	6	337.59	24000	34.6466	73	9	34.6466

== Status ==
Current time: 2021-12-01 23:36:21 (running for 00:06:56.15)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	6	337.59	24000	34.6466	73	9	34.6466

== Status ==
Current time: 2021-12-01 23:36:26 (running for 00:07:01.36)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	6	337.59	24000	34.6466	73	9	34.6466

== Status ==
Current time: 2021-12-01 23:36:33 (running for 00:07:07.50)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	6	337.59	24000	34.6466	73	9	34.6466

== Status ==
Current time: 2021-12-01 23:36:38 (running for 00:07:13.36)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	7	394.753	28000	33.6525	80	10	33.6525

== Status ==
Current time: 2021-12-01 23:36:44 (running for 00:07:18.46)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	7	394.753	28000	33.6525	80	10	33.6525

== Status ==
Current time: 2021-12-01 23:36:49 (running for 00:07:23.58)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	7	394.753	28000	33.6525	80	10	33.6525

== Status ==
Current time: 2021-12-01 23:36:54 (running for 00:07:28.73)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	7	394.753	28000	33.6525	80	10	33.6525

== Status ==
Current time: 2021-12-01 23:36:59 (running for 00:07:33.83)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	7	394.753	28000	33.6525	80	10	33.6525

== Status ==
Current time: 2021-12-01 23:37:04 (running for 00:07:39.02)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	7	394.753	28000	33.6525	80	10	33.6525

== Status ==
Current time: 2021-12-01 23:37:09 (running for 00:07:44.13)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	7	394.753	28000	33.6525	80	10	33.6525

== Status ==
Current time: 2021-12-01 23:37:14 (running for 00:07:49.33)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	7	394.753	28000	33.6525	80	10	33.6525

== Status ==
Current time: 2021-12-01 23:37:19 (running for 00:07:54.45)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	7	394.753	28000	33.6525	80	10	33.6525

== Status ==
Current time: 2021-12-01 23:37:25 (running for 00:07:59.62)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	7	394.753	28000	33.6525	80	10	33.6525

== Status ==
Current time: 2021-12-01 23:37:30 (running for 00:08:04.69)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	7	394.753	28000	33.6525	80	10	33.6525

== Status ==
Current time: 2021-12-01 23:37:36 (running for 00:08:10.57)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	8	450.921	32000	34.3077	83	9	34.3077

== Status ==
Current time: 2021-12-01 23:37:41 (running for 00:08:15.68)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	8	450.921	32000	34.3077	83	9	34.3077

== Status ==
Current time: 2021-12-01 23:37:46 (running for 00:08:20.81)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	8	450.921	32000	34.3077	83	9	34.3077

== Status ==
Current time: 2021-12-01 23:37:51 (running for 00:08:25.95)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	8	450.921	32000	34.3077	83	9	34.3077

== Status ==
Current time: 2021-12-01 23:37:56 (running for 00:08:31.11)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	8	450.921	32000	34.3077	83	9	34.3077

== Status ==
Current time: 2021-12-01 23:38:01 (running for 00:08:36.23)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	8	450.921	32000	34.3077	83	9	34.3077

== Status ==
Current time: 2021-12-01 23:38:06 (running for 00:08:41.36)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	8	450.921	32000	34.3077	83	9	34.3077

== Status ==
Current time: 2021-12-01 23:38:13 (running for 00:08:47.46)
Memory usage on this node: 9.8/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	8	450.921	32000	34.3077	83	9	34.3077

== Status ==
Current time: 2021-12-01 23:38:18 (running for 00:08:52.61)
Memory usage on this node: 9.8/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	8	450.921	32000	34.3077	83	9	34.3077

== Status ==
Current time: 2021-12-01 23:38:23 (running for 00:08:57.71)
Memory usage on this node: 9.8/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	8	450.921	32000	34.3077	83	9	34.3077

== Status ==
Current time: 2021-12-01 23:38:28 (running for 00:09:02.79)
Memory usage on this node: 9.8/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	9	507.087	36000	39.4554	88	10	39.4554

== Status ==
Current time: 2021-12-01 23:38:33 (running for 00:09:07.85)
Memory usage on this node: 9.8/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	9	507.087	36000	39.4554	88	10	39.4554

== Status ==
Current time: 2021-12-01 23:38:38 (running for 00:09:12.92)
Memory usage on this node: 9.8/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	9	507.087	36000	39.4554	88	10	39.4554

== Status ==
Current time: 2021-12-01 23:38:43 (running for 00:09:18.02)
Memory usage on this node: 9.8/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	9	507.087	36000	39.4554	88	10	39.4554

== Status ==
Current time: 2021-12-01 23:38:48 (running for 00:09:23.19)
Memory usage on this node: 9.8/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	9	507.087	36000	39.4554	88	10	39.4554

== Status ==
Current time: 2021-12-01 23:38:53 (running for 00:09:28.29)
Memory usage on this node: 9.8/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	9	507.087	36000	39.4554	88	10	39.4554

== Status ==
Current time: 2021-12-01 23:38:58 (running for 00:09:33.45)
Memory usage on this node: 9.8/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	9	507.087	36000	39.4554	88	10	39.4554

== Status ==
Current time: 2021-12-01 23:39:04 (running for 00:09:38.60)
Memory usage on this node: 9.8/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	9	507.087	36000	39.4554	88	10	39.4554

== Status ==
Current time: 2021-12-01 23:39:09 (running for 00:09:43.70)
Memory usage on this node: 9.8/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	9	507.087	36000	39.4554	88	10	39.4554

== Status ==
Current time: 2021-12-01 23:39:14 (running for 00:09:48.72)
Memory usage on this node: 9.8/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	9	507.087	36000	39.4554	88	10	39.4554

== Status ==
Current time: 2021-12-01 23:39:19 (running for 00:09:53.85)
Memory usage on this node: 9.8/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	RUNNING	127.0.0.1:10736	9	507.087	36000	39.4554	88	10	39.4554

== Status ==
Current time: 2021-12-01 23:39:23 (running for 00:09:57.60)
Memory usage on this node: 9.8/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 TERMINATED)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StackedStatelessCartPole_23a62_00000	TERMINATED	127.0.0.1:10736	10	562.848	40000	36.6727	80	11	36.6727

print_reward(results3b2)

Reward after 10 training iterations: 36.67272727272727

plot_rewards(results3b2)

c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

plot_learning(results3a, label="3a: Stacked, Partial Observations")
plot_learning(results3b, label="3b: LSTM")
plot_learning(results3b2, label="3b2: LSTM + Stacking")

Option 3c: Use Attention for Processing the Sequence

Self-attention is a recent and popular alternative to RNNs for processing sequence data. Currently, the transformer architecture using self-attention is state of the art for natural language processing (NLP) tasks.

A similar, yet slightly modified architecture using attention is also useful for RL (see related paper). Again, enabling attention in RLlib simply requires setting the corresponding flag in the model config:

#collapse-output

config3c = ppo.DEFAULT_CONFIG.copy()
config3c["env"] = "StatelessCartPole"
config3c["model"] = {
    # Attention net wrapping (for tf) can already use the native keras
    # model versions. For torch, this will have no effect.
    "_use_default_native_models": True,
    "use_attention": True,
    # "max_seq_len": 10,
    # "attention_num_transformer_units": 1,
    # "attention_dim": 32,
    # "attention_memory_inference": 10,
    # "attention_memory_training": 10,
    # "attention_num_heads": 1,
    # "attention_head_dim": 32,
    # "attention_position_wise_mlp_dim": 32,
}

results3c = ray.tune.run("PPO", config=config3c, stop=stop)
print("Option 3c: Training finished successfully")

== Status ==
Current time: 2021-12-01 23:39:24 (running for 00:00:00.15)
Memory usage on this node: 8.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 PENDING)

Trial name	status	loc
PPO_StatelessCartPole_88823_00000	PENDING

(pid=None) c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\redis\connection.py:77: UserWarning: redis-py works best with hiredis. Please consider installing
(pid=None)   warnings.warn(msg)
(pid=12464) 2021-12-01 23:39:35,779 INFO trainer.py:753 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
(pid=12464) 2021-12-01 23:39:35,780 INFO ppo.py:166 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
(pid=12464) 2021-12-01 23:39:35,780 INFO trainer.py:770 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(pid=None) c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\redis\connection.py:77: UserWarning: redis-py works best with hiredis. Please consider installing
(pid=None)   warnings.warn(msg)
(pid=None) c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\redis\connection.py:77: UserWarning: redis-py works best with hiredis. Please consider installing
(pid=None)   warnings.warn(msg)
(pid=12464) 2021-12-01 23:40:01,112 WARNING trainer_template.py:185 -- `execution_plan` functions should accept `trainer`, `workers`, and `config` as args!
(pid=12464) 2021-12-01 23:40:01,115 INFO trainable.py:110 -- Trainable.setup took 25.341 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
(pid=12464) 2021-12-01 23:40:01,117 WARNING util.py:57 -- Install gputil for GPU system monitoring.
(pid=12464) 2021-12-01 23:40:08,500 WARNING deprecation.py:38 -- DeprecationWarning: `slice` has been deprecated. Use `SampleBatch[start:stop]` instead. This will raise an error in the future!
(pid=12464) [2021-12-01 23:43:32,273 E 12464 11260] raylet_client.cc:159: IOError: Unknown error [RayletClient] Failed to disconnect from raylet.
(pid=12464) Windows fatal exception: access violation
(pid=12464) 
(pid=15896) [2021-12-01 23:43:32,278 E 15896 15972] raylet_client.cc:159: IOError: Unknown error [RayletClient] Failed to disconnect from raylet.
(pid=15896) Windows fatal exception: access violation
(pid=15896) 
(pid=17956) [2021-12-01 23:43:32,288 C 17956 2592] core_worker.cc:796:  Check failed: _s.ok() Bad status: IOError: Unknown error
(pid=17956) *** StackTrace Information ***
(pid=17956)     PyInit__raylet
(pid=17956)     PyInit__raylet
(pid=17956)     PyInit__raylet
(pid=17956)     PyInit__raylet
(pid=17956)     PyInit__raylet
(pid=17956)     PyInit__raylet
(pid=17956)     PyInit__raylet
(pid=17956)     PyInit__raylet
(pid=17956)     PyInit__raylet
(pid=17956)     PyInit__raylet
(pid=17956)     PyInit__raylet
(pid=17956)     PyInit__raylet
(pid=17956)     PyInit__raylet
(pid=17956)     PyInit__raylet
(pid=17956)     PyInit__raylet
(pid=17956)     PyInit__raylet
(pid=17956)     PyNumber_InPlaceLshift
(pid=17956)     Py_CheckFunctionResult
(pid=17956)     PyEval_EvalFrameDefault
(pid=17956)     Py_CheckFunctionResult
(pid=17956)     PyEval_EvalFrameDefault
(pid=17956)     PyEval_EvalCodeWithName
(pid=17956)     PyEval_EvalCodeEx
(pid=17956)     PyEval_EvalCode
(pid=17956)     PyArena_New
(pid=17956)     PyArena_New
(pid=17956)     PyRun_FileExFlags
(pid=17956)     PyRun_SimpleFileExFlags
(pid=17956)     PyRun_AnyFileExFlags
(pid=17956)     Py_FatalError
(pid=17956)     Py_RunMain
(pid=17956)     Py_RunMain
(pid=17956)     Py_Main
(pid=17956)     BaseThreadInitThunk
(pid=17956)     RtlUserThreadStart
(pid=17956) 
(pid=17956) Windows fatal exception: access violation
(pid=17956) 
(pid=17956) Stack (most recent call first):
(pid=17956)   File "c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\ray\worker.py", line 425 in main_loop
(pid=17956)   File "c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\ray\workers/default_worker.py", line 218 in <module>
2021-12-01 23:43:32,411 INFO tune.py:630 -- Total run time: 248.18 seconds (247.76 seconds for the tuning loop).

== Status ==
Current time: 2021-12-01 23:39:29 (running for 00:00:05.16)
Memory usage on this node: 8.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 PENDING)

Trial name	status	loc
PPO_StatelessCartPole_88823_00000	PENDING

== Status ==
Current time: 2021-12-01 23:40:01 (running for 00:00:36.89)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464

== Status ==
Current time: 2021-12-01 23:40:02 (running for 00:00:37.93)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464

== Status ==
Current time: 2021-12-01 23:40:07 (running for 00:00:43.01)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464

== Status ==
Current time: 2021-12-01 23:40:12 (running for 00:00:48.07)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464

== Status ==
Current time: 2021-12-01 23:40:17 (running for 00:00:53.12)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464

Result for PPO_StatelessCartPole_88823_00000:
  agent_timesteps_total: 4000
  custom_metrics: {}
  date: 2021-12-01_23-40-21
  done: false
  episode_len_mean: 23.939393939393938
  episode_media: {}
  episode_reward_max: 76.0
  episode_reward_mean: 23.939393939393938
  episode_reward_min: 10.0
  episodes_this_iter: 165
  episodes_total: 165
  experiment_id: 7e1494afa3414dd998e7cde489d370fd
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6663342118263245
          entropy_coeff: 0.0
          kl: 0.01937255822122097
          policy_loss: -0.013014732860028744
          total_loss: 154.01339721679688
          vf_explained_var: 0.006343733984977007
          vf_loss: 154.0225372314453
    num_agent_steps_sampled: 4000
    num_agent_steps_trained: 4000
    num_steps_sampled: 4000
    num_steps_trained: 4000
  iterations_since_restore: 1
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 70.58666666666666
    ram_util_percent: 83.49666666666667
  pid: 12464
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.10596248134570115
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.12798089296365409
    mean_inference_ms: 3.0966986549503033
    mean_raw_obs_processing_ms: 0.1998490762828277
  time_since_restore: 20.76484441757202
  time_this_iter_s: 20.76484441757202
  time_total_s: 20.76484441757202
  timers:
    learn_throughput: 299.129
    learn_time_ms: 13372.16
    load_throughput: 0.0
    load_time_ms: 0.0
    sample_throughput: 541.375
    sample_time_ms: 7388.589
    update_time_ms: 4.002
  timestamp: 1638398421
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 4000
  training_iteration: 1
  trial_id: '88823_00000'
  
Result for PPO_StatelessCartPole_88823_00000:
  agent_timesteps_total: 8000
  custom_metrics: {}
  date: 2021-12-01_23-40-40
  done: false
  episode_len_mean: 27.791666666666668
  episode_media: {}
  episode_reward_max: 122.0
  episode_reward_mean: 27.791666666666668
  episode_reward_min: 9.0
  episodes_this_iter: 144
  episodes_total: 309
  experiment_id: 7e1494afa3414dd998e7cde489d370fd
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6823218464851379
          entropy_coeff: 0.0
          kl: 0.023256205022335052
          policy_loss: 0.006493973080068827
          total_loss: 161.61268615722656
          vf_explained_var: 0.0007292712107300758
          vf_loss: 161.60154724121094
    num_agent_steps_sampled: 8000
    num_agent_steps_trained: 8000
    num_steps_sampled: 8000
    num_steps_trained: 8000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 2
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 73.93076923076923
    ram_util_percent: 83.53846153846153
  pid: 12464
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.11035390714468313
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.11674551834165348
    mean_inference_ms: 3.0091454225371876
    mean_raw_obs_processing_ms: 0.1931061975500896
  time_since_restore: 39.707205057144165
  time_this_iter_s: 18.942360639572144
  time_total_s: 39.707205057144165
  timers:
    learn_throughput: 313.757
    learn_time_ms: 12748.702
    load_throughput: 0.0
    load_time_ms: 0.0
    sample_throughput: 289.977
    sample_time_ms: 13794.207
    update_time_ms: 4.002
  timestamp: 1638398440
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 8000
  training_iteration: 2
  trial_id: '88823_00000'
  
Result for PPO_StatelessCartPole_88823_00000:
  agent_timesteps_total: 12000
  custom_metrics: {}
  date: 2021-12-01_23-40-59
  done: false
  episode_len_mean: 23.017142857142858
  episode_media: {}
  episode_reward_max: 66.0
  episode_reward_mean: 23.017142857142858
  episode_reward_min: 8.0
  episodes_this_iter: 175
  episodes_total: 484
  experiment_id: 7e1494afa3414dd998e7cde489d370fd
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 4.999999873689376e-05
          entropy: 0.6649836897850037
          entropy_coeff: 0.0
          kl: 0.018709277734160423
          policy_loss: -0.007455786690115929
          total_loss: 69.26802062988281
          vf_explained_var: -0.04406118765473366
          vf_loss: 69.26985931396484
    num_agent_steps_sampled: 12000
    num_agent_steps_trained: 12000
    num_steps_sampled: 12000
    num_steps_trained: 12000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 3
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 70.76666666666667
    ram_util_percent: 83.59629629629629
  pid: 12464
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.10152428396558573
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.11275608464037909
    mean_inference_ms: 2.987558953459597
    mean_raw_obs_processing_ms: 0.18824034213713292
  time_since_restore: 58.64493227005005
  time_this_iter_s: 18.937727212905884
  time_total_s: 58.64493227005005
  timers:
    learn_throughput: 318.575
    learn_time_ms: 12555.928
    load_throughput: 0.0
    load_time_ms: 0.0
    sample_throughput: 257.773
    sample_time_ms: 15517.557
    update_time_ms: 2.668
  timestamp: 1638398459
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 12000
  training_iteration: 3
  trial_id: '88823_00000'
  
Result for PPO_StatelessCartPole_88823_00000:
  agent_timesteps_total: 16000
  custom_metrics: {}
  date: 2021-12-01_23-41-20
  done: false
  episode_len_mean: 27.27891156462585
  episode_media: {}
  episode_reward_max: 65.0
  episode_reward_mean: 27.27891156462585
  episode_reward_min: 10.0
  episodes_this_iter: 147
  episodes_total: 631
  experiment_id: 7e1494afa3414dd998e7cde489d370fd
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 4.999999873689376e-05
          entropy: 0.666556715965271
          entropy_coeff: 0.0
          kl: 0.021740607917308807
          policy_loss: 0.0014071379555389285
          total_loss: 86.13334655761719
          vf_explained_var: -0.020127560943365097
          vf_loss: 86.12541961669922
    num_agent_steps_sampled: 16000
    num_agent_steps_trained: 16000
    num_steps_sampled: 16000
    num_steps_trained: 16000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 4
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 77.01379310344828
    ram_util_percent: 83.93103448275863
  pid: 12464
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.10143838749352374
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.10536623976424975
    mean_inference_ms: 2.9572038509498433
    mean_raw_obs_processing_ms: 0.2044611579055872
  time_since_restore: 79.04436016082764
  time_this_iter_s: 20.399427890777588
  time_total_s: 79.04436016082764
  timers:
    learn_throughput: 311.522
    learn_time_ms: 12840.182
    load_throughput: 0.0
    load_time_ms: 0.0
    sample_throughput: 244.428
    sample_time_ms: 16364.753
    update_time_ms: 3.002
  timestamp: 1638398480
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 16000
  training_iteration: 4
  trial_id: '88823_00000'
  
Result for PPO_StatelessCartPole_88823_00000:
  agent_timesteps_total: 20000
  custom_metrics: {}
  date: 2021-12-01_23-41-45
  done: false
  episode_len_mean: 23.446428571428573
  episode_media: {}
  episode_reward_max: 102.0
  episode_reward_mean: 23.446428571428573
  episode_reward_min: 9.0
  episodes_this_iter: 168
  episodes_total: 799
  experiment_id: 7e1494afa3414dd998e7cde489d370fd
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 4.999999873689376e-05
          entropy: 0.6605092287063599
          entropy_coeff: 0.0
          kl: 0.019639955833554268
          policy_loss: -0.012088990770280361
          total_loss: 87.5208740234375
          vf_explained_var: -0.08362725377082825
          vf_loss: 87.52411651611328
    num_agent_steps_sampled: 20000
    num_agent_steps_trained: 20000
    num_steps_sampled: 20000
    num_steps_trained: 20000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 5
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 89.04
    ram_util_percent: 84.47428571428571
  pid: 12464
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.1082024611015279
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.11219990467946991
    mean_inference_ms: 3.1819671927268045
    mean_raw_obs_processing_ms: 0.2181941025366806
  time_since_restore: 104.5215425491333
  time_this_iter_s: 25.477182388305664
  time_total_s: 104.5215425491333
  timers:
    learn_throughput: 297.013
    learn_time_ms: 13467.402
    load_throughput: 19987152.728
    load_time_ms: 0.2
    sample_throughput: 225.522
    sample_time_ms: 17736.596
    update_time_ms: 2.401
  timestamp: 1638398505
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 20000
  training_iteration: 5
  trial_id: '88823_00000'
  
Result for PPO_StatelessCartPole_88823_00000:
  agent_timesteps_total: 24000
  custom_metrics: {}
  date: 2021-12-01_23-42-12
  done: false
  episode_len_mean: 28.964285714285715
  episode_media: {}
  episode_reward_max: 94.0
  episode_reward_mean: 28.964285714285715
  episode_reward_min: 9.0
  episodes_this_iter: 140
  episodes_total: 939
  experiment_id: 7e1494afa3414dd998e7cde489d370fd
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 4.999999873689376e-05
          entropy: 0.6493598818778992
          entropy_coeff: 0.0
          kl: 0.008863288909196854
          policy_loss: -0.009960012510418892
          total_loss: 130.17213439941406
          vf_explained_var: -0.04542897269129753
          vf_loss: 130.1781005859375
    num_agent_steps_sampled: 24000
    num_agent_steps_trained: 24000
    num_steps_sampled: 24000
    num_steps_trained: 24000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 6
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 87.96944444444443
    ram_util_percent: 84.64722222222221
  pid: 12464
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.12465831386460766
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.12816464580125522
    mean_inference_ms: 3.389497130627828
    mean_raw_obs_processing_ms: 0.2190003983187485
  time_since_restore: 130.94173955917358
  time_this_iter_s: 26.420197010040283
  time_total_s: 130.94173955917358
  timers:
    learn_throughput: 287.926
    learn_time_ms: 13892.464
    load_throughput: 11985152.518
    load_time_ms: 0.334
    sample_throughput: 208.532
    sample_time_ms: 19181.747
    update_time_ms: 4.777
  timestamp: 1638398532
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 24000
  training_iteration: 6
  trial_id: '88823_00000'
  
Result for PPO_StatelessCartPole_88823_00000:
  agent_timesteps_total: 28000
  custom_metrics: {}
  date: 2021-12-01_23-42-32
  done: false
  episode_len_mean: 35.74107142857143
  episode_media: {}
  episode_reward_max: 111.0
  episode_reward_mean: 35.74107142857143
  episode_reward_min: 10.0
  episodes_this_iter: 112
  episodes_total: 1051
  experiment_id: 7e1494afa3414dd998e7cde489d370fd
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 4.999999873689376e-05
          entropy: 0.6521294116973877
          entropy_coeff: 0.0
          kl: 0.008087929338216782
          policy_loss: -0.001228039851412177
          total_loss: 156.5521697998047
          vf_explained_var: -0.018433474004268646
          vf_loss: 156.5497589111328
    num_agent_steps_sampled: 28000
    num_agent_steps_trained: 28000
    num_steps_sampled: 28000
    num_steps_trained: 28000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 7
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 79.4392857142857
    ram_util_percent: 84.67857142857142
  pid: 12464
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.1278596423707056
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.12899242071720216
    mean_inference_ms: 3.353225638041823
    mean_raw_obs_processing_ms: 0.2171522871719407
  time_since_restore: 151.05801963806152
  time_this_iter_s: 20.11628007888794
  time_total_s: 151.05801963806152
  timers:
    learn_throughput: 291.389
    learn_time_ms: 13727.331
    load_throughput: 9323635.44
    load_time_ms: 0.429
    sample_throughput: 202.114
    sample_time_ms: 19790.796
    update_time_ms: 4.81
  timestamp: 1638398552
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 28000
  training_iteration: 7
  trial_id: '88823_00000'
  
Result for PPO_StatelessCartPole_88823_00000:
  agent_timesteps_total: 32000
  custom_metrics: {}
  date: 2021-12-01_23-42-52
  done: false
  episode_len_mean: 28.52857142857143
  episode_media: {}
  episode_reward_max: 117.0
  episode_reward_mean: 28.52857142857143
  episode_reward_min: 8.0
  episodes_this_iter: 140
  episodes_total: 1191
  experiment_id: 7e1494afa3414dd998e7cde489d370fd
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 4.999999873689376e-05
          entropy: 0.6189269423484802
          entropy_coeff: 0.0
          kl: 0.010187552310526371
          policy_loss: -0.01204092800617218
          total_loss: 135.2486572265625
          vf_explained_var: -0.08293487131595612
          vf_loss: 135.25611877441406
    num_agent_steps_sampled: 32000
    num_agent_steps_trained: 32000
    num_steps_sampled: 32000
    num_steps_trained: 32000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 8
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 80.66551724137932
    ram_util_percent: 84.77931034482759
  pid: 12464
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.12512978011718637
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.12649312220671538
    mean_inference_ms: 3.3173461928775416
    mean_raw_obs_processing_ms: 0.21560295664752305
  time_since_restore: 171.38141465187073
  time_this_iter_s: 20.323395013809204
  time_total_s: 171.38141465187073
  timers:
    learn_throughput: 292.724
    learn_time_ms: 13664.765
    load_throughput: 7989150.476
    load_time_ms: 0.501
    sample_throughput: 202.015
    sample_time_ms: 19800.559
    update_time_ms: 4.709
  timestamp: 1638398572
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 32000
  training_iteration: 8
  trial_id: '88823_00000'
  
Result for PPO_StatelessCartPole_88823_00000:
  agent_timesteps_total: 36000
  custom_metrics: {}
  date: 2021-12-01_23-43-12
  done: false
  episode_len_mean: 35.785714285714285
  episode_media: {}
  episode_reward_max: 91.0
  episode_reward_mean: 35.785714285714285
  episode_reward_min: 10.0
  episodes_this_iter: 112
  episodes_total: 1303
  experiment_id: 7e1494afa3414dd998e7cde489d370fd
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 4.999999873689376e-05
          entropy: 0.616661548614502
          entropy_coeff: 0.0
          kl: 0.00952129065990448
          policy_loss: -0.00046885418123565614
          total_loss: 135.04832458496094
          vf_explained_var: -0.05452437326312065
          vf_loss: 135.0445098876953
    num_agent_steps_sampled: 36000
    num_agent_steps_trained: 36000
    num_steps_sampled: 36000
    num_steps_trained: 36000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 9
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 76.94814814814812
    ram_util_percent: 84.89629629629631
  pid: 12464
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.1238827989880333
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.12481765181822614
    mean_inference_ms: 3.2899851537455973
    mean_raw_obs_processing_ms: 0.21254970774456478
  time_since_restore: 190.8527319431305
  time_this_iter_s: 19.471317291259766
  time_total_s: 190.8527319431305
  timers:
    learn_throughput: 295.82
    learn_time_ms: 13521.73
    load_throughput: 8987794.286
    load_time_ms: 0.445
    sample_throughput: 201.373
    sample_time_ms: 19863.604
    update_time_ms: 4.629
  timestamp: 1638398592
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 36000
  training_iteration: 9
  trial_id: '88823_00000'
  
Result for PPO_StatelessCartPole_88823_00000:
  agent_timesteps_total: 40000
  custom_metrics: {}
  date: 2021-12-01_23-43-31
  done: true
  episode_len_mean: 35.06140350877193
  episode_media: {}
  episode_reward_max: 147.0
  episode_reward_mean: 35.06140350877193
  episode_reward_min: 10.0
  episodes_this_iter: 114
  episodes_total: 1417
  experiment_id: 7e1494afa3414dd998e7cde489d370fd
  hostname: nb-stschn
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 4.999999873689376e-05
          entropy: 0.610243558883667
          entropy_coeff: 0.0
          kl: 0.0024006376042962074
          policy_loss: -0.00499696284532547
          total_loss: 148.4215850830078
          vf_explained_var: -0.0558871254324913
          vf_loss: 148.42550659179688
    num_agent_steps_sampled: 40000
    num_agent_steps_trained: 40000
    num_steps_sampled: 40000
    num_steps_trained: 40000
    num_steps_trained_this_iter: 0
  iterations_since_restore: 10
  node_ip: 127.0.0.1
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 78.55714285714284
    ram_util_percent: 84.91785714285713
  pid: 12464
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.12214391459572835
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.12507529134994974
    mean_inference_ms: 3.262920007962691
    mean_raw_obs_processing_ms: 0.21362642628938813
  time_since_restore: 210.48571395874023
  time_this_iter_s: 19.63298201560974
  time_total_s: 210.48571395874023
  timers:
    learn_throughput: 298.029
    learn_time_ms: 13421.519
    load_throughput: 9986438.095
    load_time_ms: 0.401
    sample_throughput: 201.72
    sample_time_ms: 19829.44
    update_time_ms: 4.566
  timestamp: 1638398611
  timesteps_since_restore: 0
  timesteps_this_iter: 0
  timesteps_total: 40000
  training_iteration: 10
  trial_id: '88823_00000'
  
Option 3c: Training finished successfully

== Status ==
Current time: 2021-12-01 23:40:22 (running for 00:00:58.71)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	1	20.7648	4000	23.9394	76	10	23.9394

== Status ==
Current time: 2021-12-01 23:40:27 (running for 00:01:03.74)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	1	20.7648	4000	23.9394	76	10	23.9394

== Status ==
Current time: 2021-12-01 23:40:33 (running for 00:01:08.80)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	1	20.7648	4000	23.9394	76	10	23.9394

== Status ==
Current time: 2021-12-01 23:40:38 (running for 00:01:13.84)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	1	20.7648	4000	23.9394	76	10	23.9394

== Status ==
Current time: 2021-12-01 23:40:43 (running for 00:01:19.70)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	2	39.7072	8000	27.7917	122	9	27.7917

== Status ==
Current time: 2021-12-01 23:40:49 (running for 00:01:25.75)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	2	39.7072	8000	27.7917	122	9	27.7917

== Status ==
Current time: 2021-12-01 23:40:56 (running for 00:01:31.81)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	2	39.7072	8000	27.7917	122	9	27.7917

== Status ==
Current time: 2021-12-01 23:41:01 (running for 00:01:37.66)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	3	58.6449	12000	23.0171	66	8	23.0171

== Status ==
Current time: 2021-12-01 23:41:07 (running for 00:01:43.73)
Memory usage on this node: 9.9/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	3	58.6449	12000	23.0171	66	8	23.0171

== Status ==
Current time: 2021-12-01 23:41:13 (running for 00:01:48.78)
Memory usage on this node: 10.0/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	3	58.6449	12000	23.0171	66	8	23.0171

== Status ==
Current time: 2021-12-01 23:41:18 (running for 00:01:54.22)
Memory usage on this node: 10.0/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	3	58.6449	12000	23.0171	66	8	23.0171

== Status ==
Current time: 2021-12-01 23:41:24 (running for 00:02:00.18)
Memory usage on this node: 10.0/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	4	79.0444	16000	27.2789	65	10	27.2789

== Status ==
Current time: 2021-12-01 23:41:29 (running for 00:02:05.30)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	4	79.0444	16000	27.2789	65	10	27.2789

== Status ==
Current time: 2021-12-01 23:41:34 (running for 00:02:10.38)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	4	79.0444	16000	27.2789	65	10	27.2789

== Status ==
Current time: 2021-12-01 23:41:39 (running for 00:02:15.49)
Memory usage on this node: 10.0/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	4	79.0444	16000	27.2789	65	10	27.2789

== Status ==
Current time: 2021-12-01 23:41:44 (running for 00:02:20.59)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	4	79.0444	16000	27.2789	65	10	27.2789

== Status ==
Current time: 2021-12-01 23:41:49 (running for 00:02:25.71)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	5	104.522	20000	23.4464	102	9	23.4464

== Status ==
Current time: 2021-12-01 23:41:55 (running for 00:02:30.94)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	5	104.522	20000	23.4464	102	9	23.4464

== Status ==
Current time: 2021-12-01 23:42:00 (running for 00:02:36.06)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	5	104.522	20000	23.4464	102	9	23.4464

== Status ==
Current time: 2021-12-01 23:42:05 (running for 00:02:41.16)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	5	104.522	20000	23.4464	102	9	23.4464

== Status ==
Current time: 2021-12-01 23:42:10 (running for 00:02:46.26)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	5	104.522	20000	23.4464	102	9	23.4464

== Status ==
Current time: 2021-12-01 23:42:16 (running for 00:02:52.11)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	6	130.942	24000	28.9643	94	9	28.9643

== Status ==
Current time: 2021-12-01 23:42:21 (running for 00:02:57.22)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	6	130.942	24000	28.9643	94	9	28.9643

== Status ==
Current time: 2021-12-01 23:42:26 (running for 00:03:02.28)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	6	130.942	24000	28.9643	94	9	28.9643

== Status ==
Current time: 2021-12-01 23:42:31 (running for 00:03:07.37)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	6	130.942	24000	28.9643	94	9	28.9643

== Status ==
Current time: 2021-12-01 23:42:37 (running for 00:03:13.23)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	7	151.058	28000	35.7411	111	10	35.7411

== Status ==
Current time: 2021-12-01 23:42:42 (running for 00:03:18.32)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	7	151.058	28000	35.7411	111	10	35.7411

== Status ==
Current time: 2021-12-01 23:42:47 (running for 00:03:23.38)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	7	151.058	28000	35.7411	111	10	35.7411

== Status ==
Current time: 2021-12-01 23:42:52 (running for 00:03:28.45)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	7	151.058	28000	35.7411	111	10	35.7411

== Status ==
Current time: 2021-12-01 23:42:57 (running for 00:03:33.61)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	8	171.381	32000	28.5286	117	8	28.5286

== Status ==
Current time: 2021-12-01 23:43:02 (running for 00:03:38.66)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	8	171.381	32000	28.5286	117	8	28.5286

== Status ==
Current time: 2021-12-01 23:43:07 (running for 00:03:43.70)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	8	171.381	32000	28.5286	117	8	28.5286

== Status ==
Current time: 2021-12-01 23:43:13 (running for 00:03:49.08)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	9	190.853	36000	35.7857	91	10	35.7857

== Status ==
Current time: 2021-12-01 23:43:18 (running for 00:03:54.13)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	9	190.853	36000	35.7857	91	10	35.7857

== Status ==
Current time: 2021-12-01 23:43:23 (running for 00:03:59.18)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	9	190.853	36000	35.7857	91	10	35.7857

== Status ==
Current time: 2021-12-01 23:43:28 (running for 00:04:04.24)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 RUNNING)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	RUNNING	127.0.0.1:12464	9	190.853	36000	35.7857	91	10	35.7857

== Status ==
Current time: 2021-12-01 23:43:32 (running for 00:04:07.82)
Memory usage on this node: 10.1/11.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/3 CPUs, 0/0 GPUs, 0.0/1.31 GiB heap, 0.0/0.65 GiB objects
Result logdir: C:\Users\Stefan\ray_results\PPO
Number of trials: 1/1 (1 TERMINATED)

Trial name	status	loc	iter	total time (s)	ts	reward	episode_reward_max	episode_reward_min	episode_len_mean
PPO_StatelessCartPole_88823_00000	TERMINATED	127.0.0.1:12464	10	210.486	40000	35.0614	147	10	35.0614

print_reward(results3c)

Reward after 10 training iterations: 35.06140350877193

plot_rewards(results3c)

c:\users\stefan\git-repos\private\blog\venv\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

plot_learning(results1, label="1: Full Observations")
plot_learning(results2, label="2: Partial Observations")
plot_learning(results3a, label="3a: Stacked, Partial Observations")
plot_learning(results3b, label="3b: LSTM")
plot_learning(results3c, label="3c: Attention")

Attention with Stacked Observations

Important

This blog post is still work in progress. Currently, there seems to be an issue with attention in RLlib.

Example: The CartPole Gym Environment

Options for Dealing With Partial Observations

Setup

Option 1: Explicitly Add Missing State

Option 2: Ignore Missing State

Option 3: Use Sequence of Last Observations

Option 3a: Use Raw Sequence as Input

Stacking Observations Using Gym’s FrameStack Wrapper

Stacking Observations Using RLlib’s Trajectory API

Option 3b: Use an LSTM for Processing the Sequence

LSTM with Stacked Observations

Option 3c: Use Attention for Processing the Sequence

Attention with Stacked Observations

Stacking Observations Using Gym’s `FrameStack` Wrapper