[WIP] Analysis Filters #1026

dwhswenson · 2021-05-15T17:34:14Z

This PR introduces analysis filters, an idea I've had for a while but just finally figured out how to implement correctly.

Basically, the idea is that a lot of OPS analysis code looks something like:

for step in steps:
    mover = step.change.canonical.mover
    if not step.change.accepted and isinstance(mover, my_mover_type):
        trials = paths.SampleSet(step.change.canonical.trials)
        if my_ensemble in trials.ensembles:
            sample = trials[my_ensemble]
            # do something with that sample

Here the hypothetical user is checking rejected trials of a specific mover and a specific ensemble. It's some messy code already a few levels of indent in, and with all that step.change stuff they need to remember. (And this is even using the hack that you can get a sample for a specific ensemble by converting the list of samples to a SampleSet.)

[EDIT: I updated the above to actually reflect the same behavior as below -- it's a little more complicated]

Here's how it works with the filters I'm introducing here:

filt = SampleFilter(
    step_filter=RejectedSteps & CanonicalMover(my_mover_type),
    sample_selector=TrialSamples,
    sample_filter=Ensemble(my_ensemble),
)
for sample in filt(steps, flatten=True):
    # do something with that sample

I guess it doesn't save too much vertical space, but it does save some characters, and I think it will be much easier to learn/teach this approach.

More usage in the notebook added in this PR. For easy use cases, it's really nice (for step in RejectedSteps(steps))

For discussion: does it make more sense stylistically to be:

RejectedSteps & CanonicalMover(my_mover_type)

or

rejected_steps & canonical_mover(my_mover_type)

Formally, I think PEP8 would advise

REJECTED_STEPS & CanonicalMover(my_mover_type)

since rejected states is actually a constant instance, while canonical mover really is a class. But consistency in user experience with filters is more important than foolish consistency with PEP8.

codecov · 2021-05-15T19:09:51Z

Codecov Report

Base: 81.75% // Head: 81.65% // Decreases project coverage by -0.10% ⚠️

Coverage data is based on head (96495a3) compared to base (02102f3).
Patch coverage: 73.73% of modified lines in pull request are covered.

❗ Current head 96495a3 differs from pull request most recent head 999bea2. Consider uploading reports for the commit 999bea2 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1026      +/-   ##
==========================================
- Coverage   81.75%   81.65%   -0.11%     
==========================================
  Files         142      143       +1     
  Lines       15612    15810     +198     
==========================================
+ Hits        12764    12910     +146     
- Misses       2848     2900      +52

Impacted Files	Coverage Δ
openpathsampling/analysis/filters.py	`73.73% <73.73%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

dwhswenson · 2021-05-15T20:20:01Z

I started to play with using the __matmul__ special method to create a simple syntax. I'd appreciate feedback on whether this syntax is understandable or if it is too obscure. The filter described above would be defined as:

filt = TrialSamples @ (RejectedSteps & CanonicalMover(my_mover_type), Ensemble(my_ensemble))

Eventually I'd like to remove the tuple and make it so a user can use & where the comma is, but that would require dealing with the distributive property of boolean algebra, which I'm not terribly interested in implementing right now. We'd probably continue to support this older notation, since it's easy to check whether the argument is a 2-tuple (and the implementation is trivial).

A few simpler examples of this syntax:

TrialSamples @ (RejectedSteps, None)  # all rejected samples
ActiveSamples @ (None, Ensemble(my_ensemble))  # samples repeated with correct weight for this ensemble
TrialSamples @ (RejectedSteps, Ensemble(my_ensemble))  # rejected samples for this ensemble

EDIT: Not liking the None above, the following syntaxes will also be available. Would appreciate feedback on the choice of using -- where seemed like another option:

TrialSamples @ (RejectedSteps, AllSamples)  # all rejected samples
TrialSamples.using(RejectedSteps)  # all rejected samples
TrialSamples.using(step_filter=RejectedSteps)  # all rejected samples

ActiveSamples @ (AllSteps, Ensemble(my_ensemble))  # samples repeated with correct weight for this ensemble
ActiveSamples.using(sample_filter=Ensemble(my_ensemble))  # samples repeated with correct weight for this ensemble

TrialSamples.using(RejectedSteps, Ensemble(my_ensemble))  # rejected samples for this ensemble

EDIT: What those simpler examples would look like with the target syntax (probably not near future, though):

TrialSamples @ RejectedSteps  # all rejected samples
ActiveSamples @ Ensemble(my_ensemble)  # samples repeated with correct weight for this ensemble
TrialSamples @ (RejectedSteps & Ensemble(my_ensemble))  # rejected samples for this ensemble

Parentheses are necessary for logical combinations because @ has higher precedence than &, |, etc..

sroet · 2021-05-17T09:18:49Z

For discussion: does it make more sense stylistically to be:

RejectedSteps & CanonicalMover(my_mover_type)

Definetly go for the double CamelCase ( you can even make RejectedSteps a class that just selects the rejected steps (with subclasses for the different types of rejections)

About the __matmul__ override, I think that that might be a bit obscure to be user facing. But it might be a welcome addition inside core OPS. This is mainly due to the strict tuple order for the filters, which could probably lead to strange user errors if done incorrectly.

One question about the output order. This seems to always return an outer loop over the steps with an inner loop over the ensembles, is there any ways for a user to invert those two loops with this syntax?

dwhswenson · 2021-05-17T12:50:46Z

Thanks for the feedback!

you can even make RejectedSteps a class that just selects the rejected steps (with subclasses for the different types of rejections)

I originally had it as a class, but I didn't like the parentheses after RejectedSteps(). Actually, I really like the way this is currently implemented:

RejectedSteps = StepFilterCondition(lambda step: not step.change.accepted,
                                    name="RejectedSteps")

(where the name is required for error strings; I may make that optional) This means you can easily create any weird custom filter. For example, let's say you wanted to select all the steps where cv(shooting_point) < 0.3:

def my_condition(step):
    # did I mention that I added an extractor called `ShootingPoints`?
    return cv(ShootingPoints(step)) < 0.3

CustomFilter = StepFilterCondition(my_condition, name="Custom")

You might need to use that filter as ShootingSteps & CustomFilter, but still -- I think that's pretty accessible even to novice users.

About the __matmul__ override, I think that that might be a bit obscure to be user facing. But it might be a welcome addition inside core OPS. This is mainly due to the strict tuple order for the filters, which could probably lead to strange user errors if done incorrectly.

Fair enough. Any thoughts on the using (or possibly where) syntax? Example:

TrialSamples.using(RejectedSteps, Ensemble(my_ensemble))

This still requires both arguments, but with this you have all the standard introspection into Python functions that IDEs use for help. I'm starting to think this should be the preferred syntax -- I think it reads pretty well. Internally, it's just a wrapper around creating the SampleFilter.

One question about the output order. This seems to always return an outer loop over the steps with an inner loop over the ensembles, is there any ways for a user to invert those two loops with this syntax?

No, and I'm not sure that's even possible. The inner loop is over Samples; the samples come from inside the step, and so aren't defined until you have picked a step.

I'm guessing you're thinking about something like standard TIS analysis, where you want to group trajectories by the ensemble? We have the steps_to_weighted_trajectories method for that. I could add a group_by_ensemble method to return a dict mapping each ensemble to list of samples (probably list here, not generator, because of user confusion when the generator is exahausted):

# samples with unique accepted trajectories, sorted by ensemble
# assuming all trajectory generation is in shooting steps (no shifting, etc.)
unique_traj_samples = TrialSamples.using(AcceptedSteps & ShootingSteps)
group_by_ensemble(unique_traj_samples(steps))

Did you have a use case in mind other than a group-by operation?

sroet · 2021-05-17T14:54:27Z

Fair enough. Any thoughts on the using (or possibly where) syntax?

With the python introspection, this seems like a good idea. I have a slight preference for using as it does not remind me of np.where (which requires explicit conditions).

Did you have a use case in mind other than a group-by operation?

Nope, the group-by was the one I was thinking about.

dwhswenson · 2021-05-18T12:47:55Z

I've been thinking more about the nested levels of filtering, and it seems that my thinking was a little off before. First, some terminology I've been using (would love to know if there's more official terms for these things):

condition filter: AcceptedSteps, RejectedSteps, Replica, etc are condition filters, i.e., filters that are based on some condition. AcceptedSteps is a step condition filter, Replica is a sample condition filter.
extractor: TrialSamples and ActiveSamples are extractors. Extractors extract some piece of information from a MCStep object.
extractor-filter: SampleFilter (or anything made by extractor.using()) is an extractor-filter. An extractor-filter combines an extractor and a filter (surprising, right?).

When I started on this, the extractors I was working on all returned lists of samples, so it made sense to add a second layer of filtering for the samples. But for other extractors, it doesn't make sense: ShootingPoints doesn't return samples -- you might want a step condition filter called TrialReplica or TrialEnsemble, which checks if any of the trials involved the given replica/ensemble. However, the filters to extract the right sample from a list of samples aren't relevant to ShootingPoints, because there's no list of samples.

So for a general extractor (ShootingPoints will probably be a good example), it will be:

extractor.using(step_filter)

For extractors that return lists of objects, here are the possible syntaxes:

# (A) current implementation
extractor.using(step_filter, secondary_filter)
TrialSamples.using(RejectedSteps, Replica(0))

# (B) Add a `with_filter` method to extractor-filters that sets the secondary filter
extractor.using(step_filter).with_filter(secondary_filter)
TrialSamples.using(RejectedSteps).with_filter(Replica(0))

In (B) I called it with_filter instead of just filter as a reminder that this is actually setting an attribute of the object being created.

Any thoughts? I think I'm leaning toward (B) since that should make it possible to make only one class for ExtractorFilter. Another nice thing about (B) is that I'd feel more comfortable making steps an optional parameter to using:

# allows this
for step in TrialSamples.using(RejectedSteps, steps=steps):
    ...

# instead of
for step in TrialSamples.using(RejectedSteps)(steps):  # the )( often confuses new users
    ...

# or
custom_filter = TrialSamples.using(RejectedSteps)  # gets rid of )( but adds line
for step in custom_filter(steps):
    ...

# it even allows
for step in TrialSamples.using(steps=steps):  # default filter is AllSteps
    ...

sroet · 2021-05-18T13:51:32Z

I've been thinking more about the nested levels of filtering, and it seems that my thinking was a little off before. First, some terminology I've been using (would love to know if there's more official terms for these things):

condition filter: AcceptedSteps, RejectedSteps, Replica, etc are condition filters, i.e., filters that are based on some condition. AcceptedSteps is a step condition filter, Replica is a sample condition filter.
extractor: TrialSamples and ActiveSamples are extractors. Extractors extract some piece of information from a MCStep object.
extractor-filter: SampleFilter (or anything made by extractor.using()) is an extractor-filter. An extractor-filter combines an extractor and a filter (surprising, right?).

So the quickest somewhat similar reference is from pandas.Dataframes.
If you mimic their case, then your condition filter is a localisation based on a boolean mask on the condition of certain columns inside certain rows. While your extractor would be filtering based on a single column/row name.

Now, this is just my personal impression as someone who likes to think of this data as a DataFrame with a row for every MCStep, which can be pivoted arbitrarily (like a pandas.DataFrame).

I personally have no preference between A and B, but as you describe it I do think B would lead to cleaner user facing code

dwhswenson · 2021-05-18T16:25:41Z

I think of it as more of filter and apply, in the pandas sense, because technically it can apply any function to the step. But I think that, to users, "apply" might seem like an odd term here. So I guess I’ll stick with extract!

I also just noticed that the syntax using with_filter (which I've now implemented) reminds me a bit of the syntax for list comprehensions. Consider:

# to get the ensemble for a replica at each MC step
follow_replica = ActiveSamples.using(AllSteps).with_filter(Replica(0))
history = [sample.ensemble for sample in follow_replica(storage.steps)]

# write things in the order of:
# what you want, what to iterate over, the optional if-statement to filter

dwhswenson · 2021-06-15T20:36:28Z

@sroet (and @gyorgy-hantal or anyone else who might be interested) : Before I write tests for this stuff, could you take a look at the documentation notebook and give any feedback on this implementation?

I am thinking I'll probably convert these to snake_case, instead of CamelCase. Reasoning is by comparison with custom implementations. Why is it list(AcceptedSteps(steps)) in cell 6 but list(custom_filter(steps)) in cell 11? Why ShootingPoints.using(AllSteps) but timing.using(AllSteps) in cell 21? But custom_filter and timing definitely feel correct.

One point I'd like advice on: We have "extractors," which extract some piece of information from the step, and "step filters", which select certain steps based on some property. For the most part, there isn't much namespace collision (RejectedSteps is a step filter; ShootingPoints is an extractor). However, the canonical mover is something that you might either want to extract or filter over. I think filtering is more common, so my leaning is to use CanonicalMover as the name for the step filter. But what, then, should the extractor be called? (See the notebook under "Predefined objects" for names of other extractors and filters.)

I'm thinking about importing all of these into the openpathsampling.analysis namespace, rather than requiring something like from openpathsampling.analysis import filters. I'm open to other suggestions about namespace organization here.

sroet · 2021-06-16T15:32:26Z

Before I write tests for this stuff, could you take a look at the documentation notebook and give any feedback on this implementation?

Seems fine by me.

But custom_filter and timing definitely feel correct.

I do agree that that feels correct, so +1 for snake_case

But what, then, should the extractor be called?

Is there a reason we could not just switch behaviour based on call input type? Is there any overlap? (a call with a string returns a filter and any other should return as an extractor)

I'm open to other suggestions about namespace organization here.

I am in favor for importing on top namespace, but do have split files for filter and extractor (just so one could do auto-completion on openpathsampling.analysis.filters or openpathsampling.analysis.extractors if you want a quick list for available filters/extractors)

dwhswenson · 2021-07-05T21:47:01Z

Is there a reason we could not just switch behaviour based on call input type? Is there any overlap? (a call with a string returns a filter and any other should return as an extractor)

While I think that's theoretically possible, I think it might be really confusing. It isn't as simple as a factory that returns different objects depending on the setup:

# as a filter
filt1 = canonical_mover('shooting')
shooting_steps = list(filt1(storage.steps))
filt2 = canonical_mover(scheme['shooting'][0])
shooter0_steps = list(filt2(storage.steps))

# as an extractor
mover = canonical_mover(step)
ef = canonical_mover.using(rejected_steps)
n_rejected_per_mover = collections.Counter(ef(storage.steps))

Really getting this to work would require either weird flags setting the type of object it was acting as, or some metaclass work.

So far, all Extractors are plural. This makes sense when used with using to get an extract-filter, although it's a little weird in cases like this (or even shooting_points). So one possibility is to use that as a distinction:

sp = shooting_points(step)  # already doing this
mover = canonical_movers(step)
ef = canonical_movers.using(rejected_steps)
...

It risks subtle errors (most likely incorrectly using canonical_mover(step) -- using filter in place of extractor). However, since the filter's __init__ already needs to handle some type checking, it's easy enough to add a very clear error message if it receives an MCStep.

I am in favor for importing on top namespace, but do have split files for filter and extractor (just so one could do auto-completion on openpathsampling.analysis.filters or openpathsampling.analysis.extractors if you want a quick list for available filters/extractors)

I was thinking the same more for coding purposes (shorter files; clearer where to find each thing). I hadn't considered the autocompletion, but that trick gives another good reason to do that split!

sroet · 2021-07-06T10:37:07Z

So far, all Extractors are plural. This makes sense when used with using to get an extract-filter, although it's a little weird in cases like this (or even shooting_points). So one possibility is to use that as a distinction:
sp = shooting_points(step)  # already doing this
mover = canonical_movers(step)
ef = canonical_movers.using(rejected_steps)
...
It risks subtle errors (most likely incorrectly using canonical_mover(step) -- using filter in place of extractor). However, since the filter's init already needs to handle some type checking, it's easy enough to add a very clear error message if it receives an MCStep.

This probably the most reasonable solution.
Please make sure this convention of Extractors being plural makes it somewhere in the official documentation.

dwhswenson · 2021-07-06T13:53:23Z

Please make sure this convention of Extractors being plural makes it somewhere in the official documentation.

Started on docs for this (in d17a04a). There will be a prominent note discussing this at the beginning of the section on extractors.

… into analysis_filters

... well that was easy

… into analysis_filters

Start to analysis filters

9144729

dwhswenson added the feature label May 15, 2021

Add shooting points; misc updates to filters

61d2570

updates to analysis filters

0b8aa53

Add start to tests

9527d4b

dwhswenson added this to the 1.6 milestone Jun 18, 2021

dwhswenson added 2 commits July 5, 2021 10:10

cleanup of STAGES; several new filters

3ce4a88

snake_case

b91cb2e

dwhswenson mentioned this pull request Jul 6, 2021

OpenPathSampling 1.5 #1031

Merged

dwhswenson added 2 commits July 6, 2021 09:50

add canonical_movers extractor; cleanup

766d81f

Add start to docs on extractors and filters

d17a04a

dwhswenson added 6 commits July 6, 2021 12:47

get rid of f-strings

8ab5a21

outline tests we need to do

31df2f2

Merge branch 'master' of github.com:openpathsampling/openpathsampling…

0ef0885

… into analysis_filters

start on test fixtures

68ab949

Merge remote-tracking branch 'upstream/master' into analysis_filters

a9456ea

split off testing structure (separate PR)

9a3cb08

dwhswenson mentioned this pull request Sep 10, 2021

New analysis testing utilities #1068

Merged

dwhswenson added 16 commits September 12, 2021 13:39

Merge branch 'test-analysis-utils' into tmp_test_analysis_filters

d8bd07e

tests for step filter

376b876

... well that was easy

steps toward testing

4108ecd

Merge branch 'master' of github.com:openpathsampling/openpathsampling…

6ecd22b

… into analysis_filters

Merge branch 'tmp_test_analysis_filters' into analysis_filters

cefb597

update test_filters for better analysis test utils

0fca990

fix Py2, more tests

298f73b

tests for bad input to canonical_mover

ec7cb4b

test_extractor

8aaa92a

Merge remote-tracking branch 'upstream/master' into analysis_filters

96495a3

steps toward better docs

20d12f8

starting on tests for shooting analysis

80bffc4

Parts of testing for filters

a6c9cf5

steps toward tests

a31b017

Merge remote-tracking branch 'upstream/master' into analysis_filters

a2bd212

Merge remote-tracking branch 'upstream/master' into analysis_filters

999bea2

[WIP] Analysis Filters #1026

Are you sure you want to change the base?

[WIP] Analysis Filters #1026

Uh oh!

Conversation

dwhswenson commented May 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented May 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dwhswenson commented May 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sroet commented May 17, 2021

Uh oh!

dwhswenson commented May 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sroet commented May 17, 2021

Uh oh!

dwhswenson commented May 18, 2021

Uh oh!

sroet commented May 18, 2021

Uh oh!

dwhswenson commented May 18, 2021

Uh oh!

dwhswenson commented Jun 15, 2021

Uh oh!

sroet commented Jun 16, 2021

Uh oh!

dwhswenson commented Jul 5, 2021

Uh oh!

sroet commented Jul 6, 2021

Uh oh!

dwhswenson commented Jul 6, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dwhswenson commented May 15, 2021 •

edited

Loading

codecov bot commented May 15, 2021 •

edited

Loading

dwhswenson commented May 15, 2021 •

edited

Loading

dwhswenson commented May 17, 2021 •

edited

Loading