Skip to content

Conversation

@dwhswenson
Copy link
Member

@dwhswenson dwhswenson commented May 15, 2021

This PR introduces analysis filters, an idea I've had for a while but just finally figured out how to implement correctly.

Basically, the idea is that a lot of OPS analysis code looks something like:

for step in steps:
    mover = step.change.canonical.mover
    if not step.change.accepted and isinstance(mover, my_mover_type):
        trials = paths.SampleSet(step.change.canonical.trials)
        if my_ensemble in trials.ensembles:
            sample = trials[my_ensemble]
            # do something with that sample

Here the hypothetical user is checking rejected trials of a specific mover and a specific ensemble. It's some messy code already a few levels of indent in, and with all that step.change stuff they need to remember. (And this is even using the hack that you can get a sample for a specific ensemble by converting the list of samples to a SampleSet.)

[EDIT: I updated the above to actually reflect the same behavior as below -- it's a little more complicated]

Here's how it works with the filters I'm introducing here:

filt = SampleFilter(
    step_filter=RejectedSteps & CanonicalMover(my_mover_type),
    sample_selector=TrialSamples,
    sample_filter=Ensemble(my_ensemble),
)
for sample in filt(steps, flatten=True):
    # do something with that sample

I guess it doesn't save too much vertical space, but it does save some characters, and I think it will be much easier to learn/teach this approach.

More usage in the notebook added in this PR. For easy use cases, it's really nice (for step in RejectedSteps(steps))


For discussion: does it make more sense stylistically to be:

RejectedSteps & CanonicalMover(my_mover_type)

or

rejected_steps & canonical_mover(my_mover_type)

Formally, I think PEP8 would advise

REJECTED_STEPS & CanonicalMover(my_mover_type)

since rejected states is actually a constant instance, while canonical mover really is a class. But consistency in user experience with filters is more important than foolish consistency with PEP8.

@codecov
Copy link

codecov bot commented May 15, 2021

Codecov Report

Base: 81.75% // Head: 81.65% // Decreases project coverage by -0.10% ⚠️

Coverage data is based on head (96495a3) compared to base (02102f3).
Patch coverage: 73.73% of modified lines in pull request are covered.

❗ Current head 96495a3 differs from pull request most recent head 999bea2. Consider uploading reports for the commit 999bea2 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1026      +/-   ##
==========================================
- Coverage   81.75%   81.65%   -0.11%     
==========================================
  Files         142      143       +1     
  Lines       15612    15810     +198     
==========================================
+ Hits        12764    12910     +146     
- Misses       2848     2900      +52     
Impacted Files Coverage Δ
openpathsampling/analysis/filters.py 73.73% <73.73%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@dwhswenson
Copy link
Member Author

dwhswenson commented May 15, 2021

I started to play with using the __matmul__ special method to create a simple syntax. I'd appreciate feedback on whether this syntax is understandable or if it is too obscure. The filter described above would be defined as:

filt = TrialSamples @ (RejectedSteps & CanonicalMover(my_mover_type), Ensemble(my_ensemble))

Eventually I'd like to remove the tuple and make it so a user can use & where the comma is, but that would require dealing with the distributive property of boolean algebra, which I'm not terribly interested in implementing right now. We'd probably continue to support this older notation, since it's easy to check whether the argument is a 2-tuple (and the implementation is trivial).

A few simpler examples of this syntax:

TrialSamples @ (RejectedSteps, None)  # all rejected samples
ActiveSamples @ (None, Ensemble(my_ensemble))  # samples repeated with correct weight for this ensemble
TrialSamples @ (RejectedSteps, Ensemble(my_ensemble))  # rejected samples for this ensemble

EDIT: Not liking the None above, the following syntaxes will also be available. Would appreciate feedback on the choice of using -- where seemed like another option:

TrialSamples @ (RejectedSteps, AllSamples)  # all rejected samples
TrialSamples.using(RejectedSteps)  # all rejected samples
TrialSamples.using(step_filter=RejectedSteps)  # all rejected samples

ActiveSamples @ (AllSteps, Ensemble(my_ensemble))  # samples repeated with correct weight for this ensemble
ActiveSamples.using(sample_filter=Ensemble(my_ensemble))  # samples repeated with correct weight for this ensemble

TrialSamples.using(RejectedSteps, Ensemble(my_ensemble))  # rejected samples for this ensemble

EDIT: What those simpler examples would look like with the target syntax (probably not near future, though):

TrialSamples @ RejectedSteps  # all rejected samples
ActiveSamples @ Ensemble(my_ensemble)  # samples repeated with correct weight for this ensemble
TrialSamples @ (RejectedSteps & Ensemble(my_ensemble))  # rejected samples for this ensemble

Parentheses are necessary for logical combinations because @ has higher precedence than &, |, etc..

@sroet
Copy link
Member

sroet commented May 17, 2021

For discussion: does it make more sense stylistically to be:

RejectedSteps & CanonicalMover(my_mover_type)

Definetly go for the double CamelCase ( you can even make RejectedSteps a class that just selects the rejected steps (with subclasses for the different types of rejections)

About the __matmul__ override, I think that that might be a bit obscure to be user facing. But it might be a welcome addition inside core OPS. This is mainly due to the strict tuple order for the filters, which could probably lead to strange user errors if done incorrectly.

One question about the output order. This seems to always return an outer loop over the steps with an inner loop over the ensembles, is there any ways for a user to invert those two loops with this syntax?

@dwhswenson
Copy link
Member Author

dwhswenson commented May 17, 2021

Thanks for the feedback!

you can even make RejectedSteps a class that just selects the rejected steps (with subclasses for the different types of rejections)

I originally had it as a class, but I didn't like the parentheses after RejectedSteps(). Actually, I really like the way this is currently implemented:

RejectedSteps = StepFilterCondition(lambda step: not step.change.accepted,
                                    name="RejectedSteps")

(where the name is required for error strings; I may make that optional) This means you can easily create any weird custom filter. For example, let's say you wanted to select all the steps where cv(shooting_point) < 0.3:

def my_condition(step):
    # did I mention that I added an extractor called `ShootingPoints`?
    return cv(ShootingPoints(step)) < 0.3

CustomFilter = StepFilterCondition(my_condition, name="Custom")

You might need to use that filter as ShootingSteps & CustomFilter, but still -- I think that's pretty accessible even to novice users.

About the __matmul__ override, I think that that might be a bit obscure to be user facing. But it might be a welcome addition inside core OPS. This is mainly due to the strict tuple order for the filters, which could probably lead to strange user errors if done incorrectly.

Fair enough. Any thoughts on the using (or possibly where) syntax? Example:

TrialSamples.using(RejectedSteps, Ensemble(my_ensemble))

This still requires both arguments, but with this you have all the standard introspection into Python functions that IDEs use for help. I'm starting to think this should be the preferred syntax -- I think it reads pretty well. Internally, it's just a wrapper around creating the SampleFilter.

One question about the output order. This seems to always return an outer loop over the steps with an inner loop over the ensembles, is there any ways for a user to invert those two loops with this syntax?

No, and I'm not sure that's even possible. The inner loop is over Samples; the samples come from inside the step, and so aren't defined until you have picked a step.

I'm guessing you're thinking about something like standard TIS analysis, where you want to group trajectories by the ensemble? We have the steps_to_weighted_trajectories method for that. I could add a group_by_ensemble method to return a dict mapping each ensemble to list of samples (probably list here, not generator, because of user confusion when the generator is exahausted):

# samples with unique accepted trajectories, sorted by ensemble
# assuming all trajectory generation is in shooting steps (no shifting, etc.)
unique_traj_samples = TrialSamples.using(AcceptedSteps & ShootingSteps)
group_by_ensemble(unique_traj_samples(steps))

Did you have a use case in mind other than a group-by operation?

@sroet
Copy link
Member

sroet commented May 17, 2021

Fair enough. Any thoughts on the using (or possibly where) syntax?

With the python introspection, this seems like a good idea. I have a slight preference for using as it does not remind me of np.where (which requires explicit conditions).

Did you have a use case in mind other than a group-by operation?

Nope, the group-by was the one I was thinking about.

@dwhswenson
Copy link
Member Author

I've been thinking more about the nested levels of filtering, and it seems that my thinking was a little off before. First, some terminology I've been using (would love to know if there's more official terms for these things):

  • condition filter: AcceptedSteps, RejectedSteps, Replica, etc are condition filters, i.e., filters that are based on some condition. AcceptedSteps is a step condition filter, Replica is a sample condition filter.
  • extractor: TrialSamples and ActiveSamples are extractors. Extractors extract some piece of information from a MCStep object.
  • extractor-filter: SampleFilter (or anything made by extractor.using()) is an extractor-filter. An extractor-filter combines an extractor and a filter (surprising, right?).

When I started on this, the extractors I was working on all returned lists of samples, so it made sense to add a second layer of filtering for the samples. But for other extractors, it doesn't make sense: ShootingPoints doesn't return samples -- you might want a step condition filter called TrialReplica or TrialEnsemble, which checks if any of the trials involved the given replica/ensemble. However, the filters to extract the right sample from a list of samples aren't relevant to ShootingPoints, because there's no list of samples.

So for a general extractor (ShootingPoints will probably be a good example), it will be:

extractor.using(step_filter)

For extractors that return lists of objects, here are the possible syntaxes:

# (A) current implementation
extractor.using(step_filter, secondary_filter)
TrialSamples.using(RejectedSteps, Replica(0))

# (B) Add a `with_filter` method to extractor-filters that sets the secondary filter
extractor.using(step_filter).with_filter(secondary_filter)
TrialSamples.using(RejectedSteps).with_filter(Replica(0))

In (B) I called it with_filter instead of just filter as a reminder that this is actually setting an attribute of the object being created.

Any thoughts? I think I'm leaning toward (B) since that should make it possible to make only one class for ExtractorFilter. Another nice thing about (B) is that I'd feel more comfortable making steps an optional parameter to using:

# allows this
for step in TrialSamples.using(RejectedSteps, steps=steps):
    ...

# instead of
for step in TrialSamples.using(RejectedSteps)(steps):  # the )( often confuses new users
    ...

# or
custom_filter = TrialSamples.using(RejectedSteps)  # gets rid of )( but adds line
for step in custom_filter(steps):
    ...

# it even allows
for step in TrialSamples.using(steps=steps):  # default filter is AllSteps
    ...

@sroet
Copy link
Member

sroet commented May 18, 2021

I've been thinking more about the nested levels of filtering, and it seems that my thinking was a little off before. First, some terminology I've been using (would love to know if there's more official terms for these things):

condition filter: AcceptedSteps, RejectedSteps, Replica, etc are condition filters, i.e., filters that are based on some condition. AcceptedSteps is a step condition filter, Replica is a sample condition filter.
extractor: TrialSamples and ActiveSamples are extractors. Extractors extract some piece of information from a MCStep object.
extractor-filter: SampleFilter (or anything made by extractor.using()) is an extractor-filter. An extractor-filter combines an extractor and a filter (surprising, right?).

So the quickest somewhat similar reference is from pandas.Dataframes.
If you mimic their case, then your condition filter is a localisation based on a boolean mask on the condition of certain columns inside certain rows. While your extractor would be filtering based on a single column/row name.

Now, this is just my personal impression as someone who likes to think of this data as a DataFrame with a row for every MCStep, which can be pivoted arbitrarily (like a pandas.DataFrame).

I personally have no preference between A and B, but as you describe it I do think B would lead to cleaner user facing code

@dwhswenson
Copy link
Member Author

I think of it as more of filter and apply, in the pandas sense, because technically it can apply any function to the step. But I think that, to users, "apply" might seem like an odd term here. So I guess I’ll stick with extract!

I also just noticed that the syntax using with_filter (which I've now implemented) reminds me a bit of the syntax for list comprehensions. Consider:

# to get the ensemble for a replica at each MC step
follow_replica = ActiveSamples.using(AllSteps).with_filter(Replica(0))
history = [sample.ensemble for sample in follow_replica(storage.steps)]

# write things in the order of:
# what you want, what to iterate over, the optional if-statement to filter

@dwhswenson
Copy link
Member Author

@sroet (and @gyorgy-hantal or anyone else who might be interested) : Before I write tests for this stuff, could you take a look at the documentation notebook and give any feedback on this implementation?

I am thinking I'll probably convert these to snake_case, instead of CamelCase. Reasoning is by comparison with custom implementations. Why is it list(AcceptedSteps(steps)) in cell 6 but list(custom_filter(steps)) in cell 11? Why ShootingPoints.using(AllSteps) but timing.using(AllSteps) in cell 21? But custom_filter and timing definitely feel correct.

One point I'd like advice on: We have "extractors," which extract some piece of information from the step, and "step filters", which select certain steps based on some property. For the most part, there isn't much namespace collision (RejectedSteps is a step filter; ShootingPoints is an extractor). However, the canonical mover is something that you might either want to extract or filter over. I think filtering is more common, so my leaning is to use CanonicalMover as the name for the step filter. But what, then, should the extractor be called? (See the notebook under "Predefined objects" for names of other extractors and filters.)

I'm thinking about importing all of these into the openpathsampling.analysis namespace, rather than requiring something like from openpathsampling.analysis import filters. I'm open to other suggestions about namespace organization here.

@sroet
Copy link
Member

sroet commented Jun 16, 2021

Before I write tests for this stuff, could you take a look at the documentation notebook and give any feedback on this implementation?

Seems fine by me.

But custom_filter and timing definitely feel correct.

I do agree that that feels correct, so +1 for snake_case

But what, then, should the extractor be called?

Is there a reason we could not just switch behaviour based on call input type? Is there any overlap? (a call with a string returns a filter and any other should return as an extractor)

I'm open to other suggestions about namespace organization here.

I am in favor for importing on top namespace, but do have split files for filter and extractor (just so one could do auto-completion on openpathsampling.analysis.filters or openpathsampling.analysis.extractors if you want a quick list for available filters/extractors)

@dwhswenson dwhswenson added this to the 1.6 milestone Jun 18, 2021
@dwhswenson
Copy link
Member Author

Is there a reason we could not just switch behaviour based on call input type? Is there any overlap? (a call with a string returns a filter and any other should return as an extractor)

While I think that's theoretically possible, I think it might be really confusing. It isn't as simple as a factory that returns different objects depending on the setup:

# as a filter
filt1 = canonical_mover('shooting')
shooting_steps = list(filt1(storage.steps))
filt2 = canonical_mover(scheme['shooting'][0])
shooter0_steps = list(filt2(storage.steps))

# as an extractor
mover = canonical_mover(step)
ef = canonical_mover.using(rejected_steps)
n_rejected_per_mover = collections.Counter(ef(storage.steps))

Really getting this to work would require either weird flags setting the type of object it was acting as, or some metaclass work.

So far, all Extractors are plural. This makes sense when used with using to get an extract-filter, although it's a little weird in cases like this (or even shooting_points). So one possibility is to use that as a distinction:

sp = shooting_points(step)  # already doing this
mover = canonical_movers(step)
ef = canonical_movers.using(rejected_steps)
...

It risks subtle errors (most likely incorrectly using canonical_mover(step) -- using filter in place of extractor). However, since the filter's __init__ already needs to handle some type checking, it's easy enough to add a very clear error message if it receives an MCStep.

I am in favor for importing on top namespace, but do have split files for filter and extractor (just so one could do auto-completion on openpathsampling.analysis.filters or openpathsampling.analysis.extractors if you want a quick list for available filters/extractors)

I was thinking the same more for coding purposes (shorter files; clearer where to find each thing). I hadn't considered the autocompletion, but that trick gives another good reason to do that split!

@sroet
Copy link
Member

sroet commented Jul 6, 2021

So far, all Extractors are plural. This makes sense when used with using to get an extract-filter, although it's a little weird in cases like this (or even shooting_points). So one possibility is to use that as a distinction:

sp = shooting_points(step)  # already doing this
mover = canonical_movers(step)
ef = canonical_movers.using(rejected_steps)
...

It risks subtle errors (most likely incorrectly using canonical_mover(step) -- using filter in place of extractor). However, since the filter's init already needs to handle some type checking, it's easy enough to add a very clear error message if it receives an MCStep.

This probably the most reasonable solution.
Please make sure this convention of Extractors being plural makes it somewhere in the official documentation.

@dwhswenson dwhswenson mentioned this pull request Jul 6, 2021
@dwhswenson
Copy link
Member Author

Please make sure this convention of Extractors being plural makes it somewhere in the official documentation.

Started on docs for this (in d17a04a). There will be a prominent note discussing this at the beginning of the section on extractors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants