Skip to content

Conversation

@pjanx
Copy link
Contributor

@pjanx pjanx commented Apr 3, 2021

Should fix #939 and fix #932 almost entirely, shifting the bottleneck for large models. Passes tests and memcheck.

Properties of this changeset:

  • memory complexity of IdLists increases by n * sizeof(int) bytes, which is about as low as you can go here
  • IdList::Add now relies on a single memmove of PODs, which is faster than std::move-ing singular elems
  • IdList::RemoveTagged uses a bit of additional temporary memory to keep its complexity

Overall, I'm happy with this now, and I wouldn't be afraid of merging it. Commits should be reasonably atomic.

I think I inadvertently replicated half of Evil-Spirit's change. I haven't understood yet what he's doing there, though, even if it looks vaguely similar.

@pjanx
Copy link
Contributor Author

pjanx commented Apr 3, 2021

I would appreciate it if someone tried to resolve the Windows build failure, I don't speak MSVC++ error messages.

Edit: seems like this issue.

@ruevs
Copy link
Member

ruevs commented Apr 3, 2021

:-D everyone comes to more or less the same idea :-) I'll push mine tomorrow - just for comparison... But I haven't finished the iterator.

pjanx added 6 commits April 4, 2021 02:38
Maintain an array that points to IdList elements in the order
of their ID, not used for anything so far.
Use an iterator class internally to index elements using the array
established in the previous commit.

Remove a few unused methods.
Rather than keeping `elems` in ID order, add new elements to the end
and just replace removed elements with the last one.

The complexity is shifted to the `indexes` permutating array,
which is typically faster.

While using no more additional memory, RemoveTagged() is slightly
slowed down in this commit.
Allocate a bit of memory to make it O(n).
Resolves another performance regression of iteration being
O(n * log n) rather than O(n).
@ruevs
Copy link
Member

ruevs commented Apr 4, 2021

Here is mine https://github.com/ruevs/solvespace/tree/FixIdList_TheNextAttempt working and tests passing.

Very similar to yours with some differences - I used vectors as containers and I have a freelist to reduce reallocations in some cases.

I do not have the last huge "Remove NextAfter" (3b395bb) commit in the branch, but it is critical for some workloads.

Without this commit it is almost as fast (or barely slower) on this https://github.com/solvespace/solvespace/files/5962704/case-vents.slvs.gz from here #932 (comment) compared to current master.

But it is drastically faster on setting the last translate to 20 copies in this https://github.com/solvespace/solvespace/files/5974674/LeliaManya2.zip. It takes about 2s while setting it to 6 copies takes 15s on master (I did not wait for 20 to finish)!

Next I will try yours on Windows with Visual Studio 2019.

@pjanx
Copy link
Contributor Author

pjanx commented Apr 4, 2021

There should be a test for possible corner cases in RemoveTagged, I made a mistake there that passed but crashed on models.

I see how the freelist simplifies the algorithm in that method (at the cost of some more memory).

Like you, I also thought about putting some assertions in the iterator but... if they weren't there before... maybe it's not important.

I tried to change as little as possible but NextAfter just has to go.

Doing benchmarks was on my list, though the improvement was so immense that I just declared victory.

The next thing that irritates me on Solvespace is omnipresent NURBS failures and I can't really do math. How much do you with @phkahler charge? :) Triangle meshes also take a long time to compute.

@phkahler
Copy link
Member

phkahler commented Apr 4, 2021

The next thing that irritates me on Solvespace is omnipresent NURBS failures and I can't really do math. How much do you with @phkahler charge? :) Triangle meshes also take a long time to compute.

@pjanx There is a NURBS tag for issues. The failures have been getting better and most are tracked in #738. It's an area of the code that is a bit outside most peoples experience, but @ruevs and I have been chipping away at it for a while now. A few other people have had a look too which has been very helpful even when we didn't take their changes directly.

Triangle meshes are IMHO not to be used except when NURBS fail. The code there has not had the optimizations or OMP love applied so it could be slower now than doing the same with NURBS. When STL linking lands I suppose we'll hear more reports of mesh bugs.

SolveSpace usage has been increasing as have the number of contributors, so there is hope ;-)

@pjanx
Copy link
Contributor Author

pjanx commented Apr 4, 2021

Yeah, I'm having some seemingly randomized mesh issues with my most complex model already (it NURBS-fails very early). Luckily Slicer mostly takes care of it on import, or the issues aren't too severe in 3D print.

@phkahler
Copy link
Member

phkahler commented Apr 4, 2021

@pjanx the evil spirit patches replaces the list of things with a list of pointers to things. That made all the moving faster and probably reduced cache misses while searching.

@pjanx
Copy link
Contributor Author

pjanx commented Apr 4, 2021

Replicating IDs in the index like he seems to be doing would eliminate some cache misses during search-by-ID, at the cost of even more memory. I'm not sure how much it is okay to waste on this--it'd be nice to have a memory profile. Binary search isn't particularly cache-friendly in any case, as it jumps around and prefetching fails to work optimally. In that sense, this general approach will always be slower during whole-list iteration, where it'd be good to always keep it in ID order. A lot of theorising here, besides the general observation of "large models are fairly snappy now".

@ruevs
Copy link
Member

ruevs commented Apr 4, 2021

There should be a test for possible corner cases in RemoveTagged, I made a mistake there that passed but crashed on models.

Unfortunately the test suite is far, far from complete/exhaustive/ achieving good code coverage or being able to catch regressions :-( Still it's better than the "nothing" we had before Whitequark wrote it :-)

This is one reason it would have been extremely hard to merge your last commit here when she was the maintainer. One reason SolveSpace is so stable is that she was (quite rightly) very careful about merging such all encompassing changes that are hard to regression test and also "dirty up" git blame, thus making regressions harder to track down. That said, I totally agree that NextAfrer absolutely has to go - it simply makes it impossible to make IdList efficient for all cases.

I broke RemoveTagged as well until my last commit (the 0xDEADBEEF caught it ;-) )
elemidx.resize(n); // Clear left over elements at the end.

To debug the iterator I wrote a quick and dirty "unit test" of sorts for IdList with a simple (int, char) struct. But there is no good place to commit this.

@ruevs
Copy link
Member

ruevs commented Apr 4, 2021

How much do you with @phkahler charge? :)

For me it's just for fun ;-) Not to mention that I don't really feel qualified when hacking on the NURBS backend, I've forgotten all the linear algebra and calculus so long ago that I constantly have to consult references :-)

But having another mind to bounce ideas off of is extremely helpful. Thank you for that. For example you starting to work on the IdList - it has been in the back of my mind since the Whitequark commit and EvilSpirits patch (so at least four years), but you stirred things up, which lead to this pull request and my similar one.

Another one is Eigen - search the issues and pull requests ;-)

I'll make a tracking issue for performance - just as I did for NURBS in #738 .

@ruevs ruevs mentioned this pull request Apr 4, 2021
16 tasks
@pjanx
Copy link
Contributor Author

pjanx commented Apr 4, 2021

This is one reason it would have been extremely hard to merge your last commit here when she was the maintainer. One reason SolveSpace is so stable is that she was (quite rightly) very careful about merging such all encompassing changes that are hard to regression test and also "dirty up" git blame, thus making regressions harder to track down. That said, I totally agree that NextAfrer absolutely has to go - it simply makes it impossible to make IdList efficient for all cases.

As far as my experience goes, 2.3 was much less stable, and a performance minefield that made me re-learn to save often. Converting models to master is a pain, but the software is usable. I'm not sure whom to blame for this newfound stability. :)

I don't really like the "git blame" argument, it often seems like a universal excuse to never change anything. Bisection should take care of tracking regressions.

What I've noticed is that the codebase is in the process of losing some sense of coherence, which this PR doesn't particularly help. And cleaning that up is certainly an offensive against "git blame".

I broke RemoveTagged as well until my last commit (the 0xDEADBEEF caught it ;-) )
elemidx.resize(n); // Clear left over elements at the end.

I just read the source code one more time and found it. I've spent quite a few hours trying to visualise that function and put it to code, mostly because I said I would do it. Sadly I don't have any good methods for making things simpler.

@ruevs
Copy link
Member

ruevs commented Apr 4, 2021

Converting models to master is a pain, but the software is usable. I'm not sure whom to blame for this newfound stability. :)

Whiteqark ;-)

I don't really like the "git blame" argument, it often seems like a universal excuse to never change anything. Bisection should take care of tracking regressions.

In general I agree, but contingent upon available maintainer time vs. desire for stability.

Anyway, I'm pretty sure either this/yours or mine, or a cleaned up amalgamation of the two (they are pretty much the same) will get merged after 3.0.

( Both @phkahler and I have both been known to merge "stupid" stuff in the past ;-) )

Copy link
Member

@ruevs ruevs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so overall we did roughly the same. I'll steal your "remove NextAfter" commit and try mine with it tomorrow.

I think we should try to "converge" on a single solution. If you have time I would love to hear your comments/critique on my version.

int offset;

Iterator(IdList<T, H> &list, int offset) : list(&list), offset(offset) {}
T &operator*() { return list->elem[list->indexes[offset]]; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On mine I kept a local T *elem updated when the iterator is changed so that * and -> can be fast. They may be called many times per loop iteration, and ++ usually only once.

Copy link
Contributor Author

@pjanx pjanx Apr 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's one thing optimizers should be good at, unless some aliasing rules break it. I'd be somewhat surprised if it'd be measurable for the added complexity. Making offset point directly into the indexes array is more obvious. And both of these suffer if the array is resized during iteration, which STL just declares as undefined behaviour in such situations.

Iterator(IdList<T, H> &list, int offset) : list(&list), offset(offset) {}
T &operator*() { return list->elem[list->indexes[offset]]; }
T *operator->() { return &list->elem[list->indexes[offset]]; }
const T &operator*() const { return list->elem[list->indexes[offset]]; }
Copy link
Member

@ruevs ruevs Apr 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should carefully review the need for const methods and in which of the former NextAfter loops we can add const. Luckily this is just a lot of clicking and testing if the compiler will agree ;-)

What do you think - should we remove the unused overloads in the end? I'm split between having a short iterator and a "complete" one with mostly unused methods. @phkahler ? @rpavlik ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MSVC required me to add const versions here, for std::lower_bound. See the first comment of the PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to think just implement the code that's needed on the YAGNI principle. Unused methods are untested and possible traps for the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed all unused methods from mine.

Iterator &operator++() { ++offset; return *this; }
Iterator &operator+=(int diff) { offset += diff; return *this; }
int operator-(const Iterator &i) const { return offset - i.offset; }
bool operator==(const Iterator &i) { return list == i.list && offset == i.offset; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should fix this - I'm not comparing the lists for equality after I removed the bounds checking in my last commit ruevs@6bc1aa9

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should rather be an assertion rather than a direct test like this.

Copy link
Member

@ruevs ruevs Apr 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was an assertion before I removed it ;-) Maybe I'll put them back in.

auto it = std::lower_bound(begin(), end(), h, Compare());
return it;
Iterator LowerBound(H const& h) {
return std::lower_bound(begin(), end(), h, Compare());
Copy link
Member

@ruevs ruevs Apr 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted to use iterators on the index and complicated the predicate on the theory that the vector<int> iterator is simpler than my IdList one and thus hoping it would be faster. But I have not profiled to confirm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't expect a meaningful difference, it boils down to a few pointer operations. Didn't you have to add a forward declaration and complicate CompareId because of that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I called `CompareId" the predicate.

void ReserveMore(int howMuch) {
if(n + howMuch > elemsAllocated) {
elemsAllocated = n + howMuch;
int *newIndexes = (int *)::operator new[]((size_t)elemsAllocated*sizeof(int));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using vector for the storage gets rid of the manual memory management, and hopefully at least in release mode should be as fast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's one of the things that should be unified throughout the codebase, I'm just using what's already there.

I don't even know how the custom allocator library is hooked up.

@Evil-Spirit wanted to use realloc, which should be a trick that std::vector could be able to do as well, though I'm dubious whether it's useful for geometric steps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The important bit is that idlist is reference semantics, not value semantics, so you can't embed a std::vector, but rather a std::shared_ptrstd::vector - a lesson I learned painfully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rpavlik Do you mean to say that once a pointer to an element is "taken out' of an IdList it may be kept around and dereferenced after the list has been modified? And thus the pointer may be invalid?

If this is the case can you point to an example?

But this does not make sense - the original implementation of IdList did not guarantee that the elemnts will not move - in fact it moved them around all the time!

In fact my implementation is "better" in this respect since it never moves elements once they are added because the elemstore is not rearranged when elements are removed:
https://github.com/ruevs/solvespace/blob/3c6e2be4b920d706d8231030467c5271a6d87d85/src/dsc.h#L570
the removed elements are simply added to the freelist.

// Find out where the added element should be.
// Shift indexes to the end of the array, if needed.
int pos = LowerBoundIndex(*t);
memmove(indexes + pos + 1, indexes + pos, sizeof *indexes * (n - pos));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if insert in a vector<int> is as fast. I'll debug it tomorrow to see what it does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should optimize that way.


T *begin() { return IsEmpty() ? nullptr : &elem[0]; }
T *end() { return IsEmpty() ? nullptr : &elem[0] + n; }
const T *begin() const { return IsEmpty() ? nullptr : &elem[0]; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above - the const iterators may be useful for compiler optimisations, if we can add const in some places where IdLists are used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be interested in seeing evidence of any difference in this case.

elem[dest] = elem[src];
}
dest++;
std::vector<int> elem2indexes(n);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to trace through this with the debugger to really get it. Tomorrow.

@ruevs
Copy link
Member

ruevs commented Apr 4, 2021

Converting models to master is a pain

By the way, why is this? Models from 2.0-2.3 should "just open™". If you have some that don't (and are not this #120) open an issue.

@pjanx
Copy link
Contributor Author

pjanx commented Apr 4, 2021

Mostly getting rid of all those extra degrees of freedom, rotation takes a while to figure out. I don't remember much else.

Edit: and some constraint failures, so they had to be redone. I didn't keep the old models.

@ruevs
Copy link
Member

ruevs commented Apr 5, 2021

@pjanx
Tried your version - it seems to work just as fast and as correctly as mine. RemoveTagged seems correct, but it's a bit complicated without a free element list.

I created a pull request with my version with a cleaned up history: #1002
It uses your "remove NextAfter" commit.

Next I will try to see if I can measure speed differences between our implementations but I highly doubt it that there will be any.
I will also profile to see if there are some more bottlenecks related to IdList.

After that it boils down to style/aesthetics etc.

@pjanx
Copy link
Contributor Author

pjanx commented Apr 5, 2021

I think we need an arbiter. It'd be nice to have a memory profile as well, Valgrind can also do that, IIRC.

@Evil-Spirit
Copy link
Collaborator

Evil-Spirit commented Apr 7, 2021

Hi there!

Things I remeber:

  1. memove and realloc is superfast
  2. if you are doing containers yourself it is a bad idea to use another containers like std::vector. you can't control things under the hood of this. If you afraid of manual memory management - just remember, you are doing container for not doing manual memory management around whole program, but inside - you have to do it - instead, you will lost all the benefits of doing this job
  3. when you need map-lookup, you should use binary search. when you need iterating over elements - you may iterate only linear array of storage data - welcome to cache. But NOT ALL THE TIME. Some operations should be performed on sorted arrays, like saving file. You can just break round-trip tests by not iterating in the hv alpahbetic order.

@baryluk
Copy link

baryluk commented Apr 18, 2021

A bit of my input. With concerns from #932

I didn't check the memory usage, but did a bit of speed testing. For changing constraints and these PRs. Just dragging unconstrainted extrusion length by mouse, and looking when the UI is fully interactive again. I used gcc 10.2.1 on Linux, amd64. -O3 -flto -march=native. With no optimisations, these PRs are actually slower than current master, because of extra small functions and indirections in the iterators. But with optimizations on, they are a tiny bit faster.

master, -O3 -flto: ~6.6s, 2.78s for Generate::DIRTY
ruevs, -O3 -flto: ~5.4s, 2.31s for Generate::DIRTY
pjanx, -O3 -flto: ~5.5s, 2.27s for Generate::DIRTY

That doesn't sounds like a lot to me, and I don't think it is worth merging any of them it at the current state. Don't get me wrong, these is good code, and a move in a right direction, but it might require more profiling, tuning and base data structure selection.

(No OpenMP used).

UPDATE:

LeilaManya2, changing last translate group from 3 to 20 repeats.

master: ~138s, 126119ms for Generate::DIRTY
pjanx: ~1.2s, 427ms for Generate::DIRTY
ruevs: ~1.1s, 403ms for Generate::DIRTY

Nice! Ok. I might be convinced.

@ruevs
Copy link
Member

ruevs commented Apr 20, 2021

A bit of my input. With concerns from #932

I didn't check the memory usage, but did a bit of speed testing. For changing constraints and these PRs. Just dragging unconstrainted extrusion length by mouse, and looking when the UI is fully interactive again. I used gcc 10.2.1 on Linux, amd64. -O3 -flto -march=native. With no optimisations, these PRs are actually slower than current master, because of extra small functions and indirections in the iterators. But with optimizations on, they are a tiny bit faster.

master, -O3 -flto: ~6.6s, 2.78s for Generate::DIRTY
ruevs, -O3 -flto: ~5.4s, 2.31s for Generate::DIRTY
pjanx, -O3 -flto: ~5.5s, 2.27s for Generate::DIRTY

That doesn't sounds like a lot to me, and I don't think it is worth merging any of them it at the current state. Don't get me wrong, these is good code, and a move in a right direction, but it might require more profiling, tuning and base data structure selection.

(No OpenMP used).

UPDATE:

LeilaManya2, changing last translate group from 3 to 20 repeats.

master: ~138s, 126119ms for Generate::DIRTY
pjanx: ~1.2s, 427ms for Generate::DIRTY
ruevs: ~1.1s, 403ms for Generate::DIRTY

Nice! Ok. I might be convinced.

I was just about to mention https://github.com/solvespace/solvespace/files/5974674/LeliaManya2.zip when I noticed you updated your comment ;-) Thanks for running it to 20 repeats on master - I did not have the patience :-)

Would you please do the same timing comparison when modifying some dimension in the first group of this model? #841 (comment)

How do you do the measurements on Linux?

@phkahler
Copy link
Member

phkahler commented May 7, 2021

By the way, why is this? Models from 2.0-2.3 should "just open™". If you have some that don't (and are not this #120) open an issue.

Off topic, but I know some Lathe groups can have issues. Lathe used to create 2 copies of the sketch entities like at the start and end of an extrude or revolve, even though both copies were exactly the same. I removed those redundant entities so any sketch that constrained against them could break in 3.0 depending which copy the constraint used.

@phkahler
Copy link
Member

Going with the @ruevs version unless something bad happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Poor Linking performance Poor performance when extruding translate groups

6 participants