-
Notifications
You must be signed in to change notification settings - Fork 49
Prevent excessive trajectory slicing #978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent excessive trajectory slicing #978
Conversation
Codecov Report
@@ Coverage Diff @@
## master #978 +/- ##
==========================================
+ Coverage 80.25% 80.27% +0.01%
==========================================
Files 136 136
Lines 14452 14466 +14
==========================================
+ Hits 11599 11613 +14
Misses 2853 2853
Continue to review full report at Codecov.
|
|
@dwhswenson this is ready for a review. As you probably know this code better than I do, are there any other places where a similar strategy might be beneficial? (looking at it quickly, I can probably mimic this pretty easy for |
dwhswenson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do also modify _find_subtraj_first. That should help get rid of some more UUID creation.
What you have here works, but I think there's a much easier way (as opposed to the exttraj iterator, etc.) Wouldn't this do the job:
# subtraj = traj[slice(start, stop)]
# replace with
subtraj = list(traj)[slice(start, stop)]I think all we need is an iterable of snapshots, so this would create that with very little change to the code.
# subtraj = traj[slice(start, stop)]
# replace with
subtraj = list(traj)[slice(start, stop)]
I did do that in 1c15643 , this might possibly break custom ensembles, but the workaround is generally easy and the limited change/robustness in this code is worth it in my opinion. As ensembles are not fully covered under coverage, please do have a really good look at this PR. Here is a new set of numbers from this PR:
Currently this PR prevents 99.9% of all created trajectories in 10 mc steps And these number seem reasonable for 44 ensembles and 10 mc steps |
|
@dwhswenson please have another look at this |
dwhswenson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general change that you will want to do everywhere, instead of list(trajectory). Discussed in more detail in a comment, but now that I think about it, it applies to the whole changeset:
You should always try to use one of the proxy methods from Trajectory.
get_as_proxy(idx): get a single snapshot, allowing proxy snapshotsas_proxies(): return a list of all snapshots, which may include proxy snapshotsiter_proxies(): iterator over snaphots, allowing proxies
When you call list(trajectory), it will call Trajectory.__iter__(), which will fully reload the snapshot into memory. This will be slow and isn't needed for a lot of analysis. (I had forgotten about this when I suggested converting to list before; only remembered when I saw changes dropping the as_proxies.)
this might possibly break custom ensembles
I'm not sure where this risk would be. I suppose there might be something weird if you subclass SequentialEnsemble in some strange way. In general, we discourage subclassing SequentialEnsemble (unless you're changing the frame assignment algorithm). Instead, custom ensembles should be subclasses of WrappedEnsemble, wrapping the appropriate SequentialEnsemble. This is what we do for TISEnsemble, MinusInterfaceEnsemble, and VisitAllStatesEnsemble, as a few examples.
The only thing I was worried about with this change was handling proxy snapshots, which you are dealing with.
As ensembles are not fully covered under coverage, please do have a really good look at this PR.
Although ensemble.py doesn't have full coverage, the sections modified here (EnsembleCache and SequentialEnsemble) are pretty close, mainly missing coverage of logging stuff in debug guards and of a few error cases. Additionally, they're tested very thoroughly, involving what is probably thousands of individual asserts. SequentialEnsemble was some very complex code to write, and I didn't trust it until I had written every test I could think of at the time.
|
@dwhswenson implemented the I updated the table in #978 (comment) with the new data. (still sub |
dwhswenson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good. A small optional change in a comment, as well as the following thought:
It might be worth just having a function called _get_list_traj which does this getattr/list pair of operations. I suspect CodeClimate will complain about repeated code here, and since the reason for the need is obscure enough, it might be worth it to have a single location with a clearer docstring explaining what's going on. That's optional though.
In case you want to make those changes, I will leave this open for at least 24 hours, merging no earlier than Wed 24 Feb 16:00 GMT (17:00 local). If no comment/update by then, I will merge sometime after that.
Yeah, looking back at it that is the cleaner and more maintainable solution, will refactor that now. |
|
@dwhswenson refactored the code, it still results in numbers in the same order as 1cff417 . Please have a look if you think the docstring is descriptive enough. By the way, thank you for the (multiple) in-depth reviews. |
LGTM. If anything, it's probably more detailed than needed!
No problem, and thank you for making the improvements! Merging this one now. |
LGTM. If anything, it's probably more detailed than needed!
No problem, and thank you for making the improvements! Merging this one now. |
Some more UUID optimization coming from #892.
So while running the test uuid system from #892, I noticed a lot of
Trajectoryobjects getting generated. Looking at the code, every time you slice a Trajectory, you generate a new Trajectory.To test how many and were I added the following code to trajectory.py (and where)
and ran the following script:
(important part here is that I only look at the number of trajectories created after initialization)
Here is a table of number of trajectories created from running this 6 times on
masterand 6 times on this branchThis PR saves 70% of all Trajectory objects generated (during sampling).
23_000still an insane amount in my opinion for just10mc steps, but this is at least a decent step into (hopefully) not running into #892 as often.In the long term; sliced trajectories should probably return read_only views (without an uuid) and not new Trajectories