Skip to content

Conversation

@sroet
Copy link
Member

@sroet sroet commented Feb 19, 2021

Some more UUID optimization coming from #892.

So while running the test uuid system from #892, I noticed a lot of Trajectory objects getting generated. Looking at the code, every time you slice a Trajectory, you generate a new Trajectory.

To test how many and were I added the following code to trajectory.py (and where)

+++ b/openpathsampling/engines/trajectory.py
@@ -23,6 +23,7 @@ class Trajectory(list, StorableObject):
     """
 
     engine = None
+    kill_count = 1000000
 
     def __init__(self, trajectory=None):
         """
@@ -45,6 +46,10 @@ class Trajectory(list, StorableObject):
             else:
                 self.extend(trajectory)
 
+        Trajectory.kill_count -= 1
+        if Trajectory.kill_count < 0:
+            raise RuntimeError("KILLED")
+

and ran the following script:

sset = testscheme.initial_conditions_from_trajectories([eng_trj, eng_trj2,
                                                        eng_trj3],
                                                        strategies=[
                                                            'split',
                                                            'extend-complex',
                                                            'extend-minimal'])
# Error out early
sset.sanity_check()
testscheme.assert_initial_conditions(sset)

#Equilibrate
equilibration = paths.PathSampling(
    storage=None,
    sample_set=sset,
    move_scheme=testscheme
)
initial = 100000
paths.Trajectory.kill_count=initial
equilibration.run(10)
print("trajs_generated = "+str(initial-paths.Trajectory.kill_count))

(important part here is that I only look at the number of trajectories created after initialization)

Here is a table of number of trajectories created from running this 6 times on master and 6 times on this branch

main this PR
77857 20728
64553 22852
73042 14508
76004 18205
75349 27016
82440 33964
mean 74874 22879

This PR saves 70% of all Trajectory objects generated (during sampling). 23_000 still an insane amount in my opinion for just 10 mc steps, but this is at least a decent step into (hopefully) not running into #892 as often.

In the long term; sliced trajectories should probably return read_only views (without an uuid) and not new Trajectories

@codecov
Copy link

codecov bot commented Feb 19, 2021

Codecov Report

Merging #978 (7bf903c) into master (3f2c0c4) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #978      +/-   ##
==========================================
+ Coverage   80.25%   80.27%   +0.01%     
==========================================
  Files         136      136              
  Lines       14452    14466      +14     
==========================================
+ Hits        11599    11613      +14     
  Misses       2853     2853              
Impacted Files Coverage Δ
openpathsampling/ensemble.py 84.74% <100.00%> (+0.19%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3f2c0c4...7bf903c. Read the comment docs.

@sroet
Copy link
Member Author

sroet commented Feb 19, 2021

@dwhswenson this is ready for a review. As you probably know this code better than I do, are there any other places where a similar strategy might be beneficial? (looking at it quickly, I can probably mimic this pretty easy for _find_subtraj_first, but that function has not popped up yet when I try to kill on Trajectory generation)

Copy link
Member

@dwhswenson dwhswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do also modify _find_subtraj_first. That should help get rid of some more UUID creation.

What you have here works, but I think there's a much easier way (as opposed to the exttraj iterator, etc.) Wouldn't this do the job:

# subtraj = traj[slice(start, stop)]
# replace with
subtraj = list(traj)[slice(start, stop)]

I think all we need is an iterable of snapshots, so this would create that with very little change to the code.

@sroet
Copy link
Member Author

sroet commented Feb 21, 2021

What you have here works, but I think there's a much easier way (as opposed to the exttraj iterator, etc.) Wouldn't this do the job:

# subtraj = traj[slice(start, stop)]
# replace with
subtraj = list(traj)[slice(start, stop)]

I think all we need is an iterable of snapshots, so this would create that with very little change to the code.

I did do that in 1c15643 , this might possibly break custom ensembles, but the workaround is generally easy and the limited change/robustness in this code is worth it in my opinion.

As ensembles are not fully covered under coverage, please do have a really good look at this PR. Here is a new set of numbers from this PR:

master 1c15643 b89bce9 1cff417
71186 5850 62 45
75610 3257 114 60
76504 7526 86 88
67516 6842 99 113
71031 4789 47 72
67648 9378 33 99
mean 71186 6274 74 80

Currently this PR prevents 99.9% of all created trajectories in 10 mc steps And these number seem reasonable for 44 ensembles and 10 mc steps

@sroet
Copy link
Member Author

sroet commented Feb 21, 2021

@dwhswenson please have another look at this

@sroet sroet requested a review from dwhswenson February 21, 2021 17:50
Copy link
Member

@dwhswenson dwhswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general change that you will want to do everywhere, instead of list(trajectory). Discussed in more detail in a comment, but now that I think about it, it applies to the whole changeset:

You should always try to use one of the proxy methods from Trajectory.

  • get_as_proxy(idx): get a single snapshot, allowing proxy snapshots
  • as_proxies(): return a list of all snapshots, which may include proxy snapshots
  • iter_proxies(): iterator over snaphots, allowing proxies

When you call list(trajectory), it will call Trajectory.__iter__(), which will fully reload the snapshot into memory. This will be slow and isn't needed for a lot of analysis. (I had forgotten about this when I suggested converting to list before; only remembered when I saw changes dropping the as_proxies.)

this might possibly break custom ensembles

I'm not sure where this risk would be. I suppose there might be something weird if you subclass SequentialEnsemble in some strange way. In general, we discourage subclassing SequentialEnsemble (unless you're changing the frame assignment algorithm). Instead, custom ensembles should be subclasses of WrappedEnsemble, wrapping the appropriate SequentialEnsemble. This is what we do for TISEnsemble, MinusInterfaceEnsemble, and VisitAllStatesEnsemble, as a few examples.

The only thing I was worried about with this change was handling proxy snapshots, which you are dealing with.

As ensembles are not fully covered under coverage, please do have a really good look at this PR.

Although ensemble.py doesn't have full coverage, the sections modified here (EnsembleCache and SequentialEnsemble) are pretty close, mainly missing coverage of logging stuff in debug guards and of a few error cases. Additionally, they're tested very thoroughly, involving what is probably thousands of individual asserts. SequentialEnsemble was some very complex code to write, and I didn't trust it until I had written every test I could think of at the time.

@dwhswenson dwhswenson mentioned this pull request Feb 22, 2021
@sroet
Copy link
Member Author

sroet commented Feb 23, 2021

@dwhswenson implemented the proxy methods as you advised, please have another look.

I updated the table in #978 (comment) with the new data. (still sub 10 Trajectories per md step)

@sroet sroet requested a review from dwhswenson February 23, 2021 11:26
Copy link
Member

@dwhswenson dwhswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. A small optional change in a comment, as well as the following thought:

It might be worth just having a function called _get_list_traj which does this getattr/list pair of operations. I suspect CodeClimate will complain about repeated code here, and since the reason for the need is obscure enough, it might be worth it to have a single location with a clearer docstring explaining what's going on. That's optional though.

In case you want to make those changes, I will leave this open for at least 24 hours, merging no earlier than Wed 24 Feb 16:00 GMT (17:00 local). If no comment/update by then, I will merge sometime after that.

@sroet
Copy link
Member Author

sroet commented Feb 23, 2021

It might be worth just having a function called _get_list_traj which does this getattr/list pair of operations. I suspect CodeClimate will complain about repeated code here, and since the reason for the need is obscure enough, it might be worth it to have a single location with a clearer docstring explaining what's going on. That's optional though.

Yeah, looking back at it that is the cleaner and more maintainable solution, will refactor that now.

@sroet
Copy link
Member Author

sroet commented Feb 23, 2021

@dwhswenson refactored the code, it still results in numbers in the same order as 1cff417 . Please have a look if you think the docstring is descriptive enough.

By the way, thank you for the (multiple) in-depth reviews.

@dwhswenson
Copy link
Member

Please have a look if you think the docstring is descriptive enough.

LGTM. If anything, it's probably more detailed than needed!

By the way, thank you for the (multiple) in-depth reviews.

No problem, and thank you for making the improvements!

Merging this one now.

@dwhswenson dwhswenson merged commit 7d562d5 into openpathsampling:master Feb 24, 2021
@dwhswenson
Copy link
Member

Please have a look if you think the docstring is descriptive enough.

LGTM. If anything, it's probably more detailed than needed!

By the way, thank you for the (multiple) in-depth reviews.

No problem, and thank you for making the improvements!

Merging this one now.

@sroet sroet deleted the prevent_excessive_trajectory_slicing branch February 24, 2021 16:11
@dwhswenson dwhswenson mentioned this pull request Jul 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants