Skip to content

Conversation

@sven1977
Copy link
Contributor

@sven1977 sven1977 commented Nov 26, 2020

The current attention net trajectory view PR (#11729) is too large (>1000 lines added).
Therefore, I'm moving smaller preparatory and cleanup changes in ~2 pre-PRs. This is the third one of these. Only review it once this 2nd one here (#12449) has been merged.

Why are these changes needed?

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@sven1977 sven1977 changed the title [ONHOLD RLlib] Attention Net prep PR #3. [RLlib] Attention Net prep PR #3. Dec 1, 2020
@sven1977 sven1977 requested a review from ericl December 1, 2020 08:48
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sven1977 main question is the input dict option. Maybe I'm missing something, but it doesn't seem like a necessary change. Correct me if this is wrong.

# (abs_pos=-1). It's only used if the trajectory is not finished yet and we
# have to rely on the last value function output as a reward estimation.
return {
"_value_input_dict": ViewRequirement(is_input_dict=True, abs_pos=-1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if there are better names for this

# Can specify either abs_index or shift
abs_index=-1
shift=x

I think the notion of "index" is clearer in Python (-1 index means end). Also, we get to keep shift.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I'll rename these two:
abs_pos -> index
data_rel_pos -> shift

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the input_dict:
It's not necessary for this very PR, but as this is a preparatory PR (to make the attention PR smaller) I decided to already add this here. The attention net PR needs this feature to be able to not have to boiler-plate/hardcode the attention logic inside e.g. PPO's postprocessing fn (this function should not have to worry about the model being an RNN or attention net, it should not need to know).

object_manager_.FreeObjects(object_ids,
/*local_only=*/false);
},
on_objects_spilled),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? revert change

used_for_training (bool): Whether the data will be used for
training. If False, the column will not be copied into the
final train batch.
is_input_dict (bool): Whether the "view" of this requirement is an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems odd, what is the reason we need it? Are there any cleaner alternatives?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This option will be necessary for attention nets. We shouldn't have attention net or RNN-specific code in the postprocessing fn (e.g. of PPO). Instead, it's like saying: "I need an input_dict here, provide one given the model's requirements".

return self.model.value_function()[0]
# Input dict is provided to us automatically via the policy-defined
# "view". It's a single-timestep (last one in trajectory)
# input_dict.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for this change? Could we get the previous code to work without adding this if branch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above: We want the input dict to be determined by what the model needs as inputs.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 1, 2020
use_critic=policy.config["use_critic"])
else:
batch = sample_batch
sample_batch = postprocess_ppo_gae(policy, sample_batch,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-use PPO's function.
We should probably do the same for A3C and PG.

sample_batch[SampleBatch.ACTIONS][-1],
sample_batch[SampleBatch.REWARDS][-1],
*next_state)
# Input dict is provided to us automatically via the Model's
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ask Model for the input dict (from the given SampleBatc) at index=-1.

  • in prep for attention nets which have special requirements for this (different from RNNs, different from non-recursive models).
  • removes boilerplate input-dict creating code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the comment inaccurate now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is still valid.

self.fc2 = nn.Linear(self.rnn_hidden_dim, num_outputs)
self.n_agents = model_config["n_agents"]

self.inference_view_requirements.update({
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure this stays backward-compatible even w/o specifying this here.

})
self.buffers[SampleBatch.OBS].append(init_obs)
self.buffers[SampleBatch.EPS_ID].append(episode_id)
self.episode_id = episode_id
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't have to "collect" these. They are always the same for the same agent anyways.


def build(self, view_requirements: Dict[str, ViewRequirement]) -> \
SampleBatch:
def build(self, view_requirements: Dict[str, ViewRequirement],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename this to model_view_requirements for clarity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: Model.inference_view_requirements -> Model.view_requirements.

# Python primitive or dict (e.g. INFOs).
if isinstance(data, (int, float, bool, str, dict)):
self.buffers[col] = [0 for _ in range(shift)]
self.buffers[col] = [data for _ in range(shift)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important for custom initial state values. Cannot assume always 0 here.

# not view_requirements[view_col].used_for_training:
# continue
self.buffers[view_col].extend(data)
# 1) If col is not in view_requirements, we must have a direct
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this solves it:

  • if - after postprocessing - some column is not in the view-reqs, we must deal with a base-Policy child (w/o auto-view-requirement handling) -> leave as is
  • if we do have it in the view reqs AND used_for_training is False -> we must have gone through auto-detection, so it's save to remove it here (this column won't be needed for training).

data_col: view_req.space.sample()
})
data_list.append(buffers[k][data_col][time_indices])
if data_col == SampleBatch.EPS_ID:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above: episode_id is always the same within one agent's collector. No need to collect an extra buffer here.

fake_sampler: bool = False,
spaces: Optional[Dict[PolicyID, Tuple[gym.spaces.Space,
gym.spaces.Space]]] = None,
_use_trajectory_view_api: bool = True,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass this in explicitly now into RolloutWorker (was derived from policy_config before, which is problematic as this could be a partial config dict)

# inherited from base `Policy` class. At this point here, the Policy
# must have it's Model (if any) defined and ready to output an initial
# state.
for pol in self.policy_map.values():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the auto internal state -> view req here after policy has been created. This covers direct child Policies of the base Policy class, which don't have an auto-view-req mechanism.

return self.time_major is True

# TODO: (sven) Experimental method.
def get_input_dict(self, sample_batch,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Model is able to create an input_dict for a single-step forward pass from an agent's trajectory batch.

action_distribution=action_dist,
timestep=timestep,
explore=explore)
if self.config["_use_trajectory_view_api"]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed.

mo = re.match("state_in_(\d+)", view_col)
if mo is not None:
input_dict[view_col] = self._state_inputs[int(mo.group(1))]
dummy_batch[view_col] = np.zeros_like(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better do all these in one call below.

batch_for_postproc = UsageTrackingDict(sb)
batch_for_postproc.count = sb.count
logger.info("Testing `postprocess_trajectory` w/ dummy batch.")
self.exploration.postprocess_trajectory(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have to also call exploration's postprocessing (may access fields in the batch we need to track; e.g. curiosity).

# Just like torch Policy does.
self._optimizer = optimizers[0] if optimizers else None

self._initialize_loss_from_dummy_batch(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this here for consistency (same behavior as TorchPolicy). Also fixes a problem with curiosity where we do need the optimizer before loss init.


def _update_model_inference_view_requirements_from_init_state(self):
"""Uses this Model's initial state to auto-add necessary ViewReqs.
"""Uses Model's (or this Policy's) init state to add needed ViewReqs.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this more robust against Policies that don't have a model, but do return something from get_initial_state().

from ray.rllib.policy.policy import Policy, LEARNER_STATS_KEY
from ray.rllib.policy.sample_batch import SampleBatch
from ray.rllib.policy.torch_policy import TorchPolicy
from ray.rllib.policy.view_requirement import ViewRequirement
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed anymore in any Policy templates.

infos={},
new_obs=obs_batch[0])
batch = builder.build_and_reset(episode=None)
env_id = episodes[0].env_id
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed test case to use new SampleCollector.

@sven1977 sven1977 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 4, 2020
sample_batch[SampleBatch.ACTIONS][-1],
sample_batch[SampleBatch.REWARDS][-1],
*next_state)
# Input dict is provided to us automatically via the Model's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the comment inaccurate now?

view-col to data-col in them).
inference_view_requirements (Dict[str, ViewRequirement]: The view
requirements dict needed to build an input dict for a ModelV2
forward call.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This argument doesn't seem to be used, can we remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@ericl
Copy link
Contributor

ericl commented Dec 6, 2020

Looks good, but please resolve comments before merging.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 6, 2020
@sven1977
Copy link
Contributor Author

sven1977 commented Dec 6, 2020

👍 Will do.

@sven1977 sven1977 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Dec 7, 2020
@sven1977 sven1977 merged commit 99c81c6 into ray-project:master Dec 7, 2020
@sven1977 sven1977 deleted the attention_nets_prep_3 branch March 27, 2021 11:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. tests-ok The tagger certifies test failures are unrelated and assumes personal liability.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants