-
Notifications
You must be signed in to change notification settings - Fork 848
Improve Status performance with many ignored files #1379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@silkeh thanks for proposing this PR. Although this may improve the use case of multiple ignored files, it seems to cause a regression when that is not the case. I did a quick benchmark testing a call to I am struggling for time to pinpoint optimisation areas in the PR, would you be able to optimise it for the standard use case as well? |
|
@pjbgf: Thanks for taking a look! I'll take a look later this week and see what I can improve. |
|
@onee-only: That was the plan, but I've been busy with other things. I'll try to make some time next week. |
Fix loading of `.gitignore` files in nested ignored directories. These were not matched in the current implementation, as that only checks the ignores of the current directory. As a side-effect this change also leads to significantly improved performance: in a repository with 3M ignored files `Status` now takes 160 s instead of 491 s.
25c3930 to
269fa79
Compare
Skip ignored files when walking through the worktree. This signigifantly improves the performance of `Status()`: In a repository with 3M ignored files `Status` now takes 5 s instead of 160 s.
269fa79 to
dcfb50d
Compare
|
I finally had some time to take a look and rebase the PR. The difference compared to v5.16.2 is a lot faster, but this PR in its current form doesn't hurt performance too much. If you test with a repo on disk it actually seems to be a bit faster. Benchmark I used: func BenchmarkStatus(b *testing.B) {
r, err := Clone(memory.NewStorage(), memfs.New(), &CloneOptions{
URL: "https://github.com/go-git/go-git",
})
require.NoError(b, err)
workTree, err := r.Worktree()
require.NoError(b, err)
_, err = workTree.Status()
require.NoError(b, err)
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
workTree.Status()
}
}If I read the profiling correctly most of the performance degradation is introduced in
I did a tiny optimization to reduce the allocations a bit, but the rest of the performance improvements would probably have to be done in the |
onee-only
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @silkeh! I'm looking forward to this being merged. I've left my opinions on the change. I hope these help to improve the code.
| // ReadPatterns reads the .git/info/exclude and then the gitignore patterns | ||
| // recursively traversing through the directory structure. The result is in | ||
| // the ascending order of priority (last higher). | ||
| func ReadPatterns(fs billy.Filesystem, path []string) (ps []Pattern, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| func ReadPatterns(fs billy.Filesystem, path []string) (ps []Pattern, err error) { | |
| func ReadPatterns(fs billy.Filesystem, path string) (ps []Pattern, err error) { |
Since exported function is now separated, can we make path to single string? Without reading it's internal behavior, I first thought it represents specific directories to read ignore patterns, not path parts to be joined.
| noder.Noder | ||
|
|
||
| matcher Matcher | ||
| invert bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate what invert does? It seems that the value of invert is always true.
| type MatchNoder struct { | ||
| noder.Noder | ||
|
|
||
| matcher Matcher | ||
| invert bool | ||
| path []string | ||
| children []noder.Noder | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
9727b9b Introduced a Option struct in utils/merkletrie/filesystem. What do you think of adding gitignore.Matcher as an option in there? I believe we can achieve higher cohension with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing that and making IgnoreNoder interface could be a good idea for decoupling the implementation from the DiffTree().
| func (n *MatchNoder) findChild(node noder.Noder, name string) (noder.Noder, bool) { | ||
| children, _ := node.Children() | ||
|
|
||
| for _, child := range children { | ||
| if child.Name() == name { | ||
| return child, true | ||
| } | ||
| } | ||
|
|
||
| return nil, false | ||
| } | ||
|
|
||
| func (n *MatchNoder) noderPaths(path noder.Path) []string { | ||
| parts := make([]string, len(path)) | ||
|
|
||
| for i, p := range path { | ||
| parts[i] = p.Name() | ||
| } | ||
|
|
||
| return parts | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| func (n *MatchNoder) findChild(node noder.Noder, name string) (noder.Noder, bool) { | |
| children, _ := node.Children() | |
| for _, child := range children { | |
| if child.Name() == name { | |
| return child, true | |
| } | |
| } | |
| return nil, false | |
| } | |
| func (n *MatchNoder) noderPaths(path noder.Path) []string { | |
| parts := make([]string, len(path)) | |
| for i, p := range path { | |
| parts[i] = p.Name() | |
| } | |
| return parts | |
| } | |
| func findChild(node noder.Noder, name string) (noder.Noder, bool) { | |
| children, _ := node.Children() | |
| for _, child := range children { | |
| if child.Name() == name { | |
| return child, true | |
| } | |
| } | |
| return nil, false | |
| } | |
| func convertToStrings(path noder.Path) []string { | |
| parts := make([]string, len(path)) | |
| for i, p := range path { | |
| parts[i] = p.Name() | |
| } | |
| return parts | |
| } |
Since these functions aren't restricted to the use case of MatchNoder, we can move the functions out of the struct.
|
|
||
| func isIgnoredNode(tree noder.Noder, path noder.Path) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| func isIgnoredNode(tree noder.Noder, path noder.Path) bool { | |
| // isIgnoredPath reports if the path is ignored in the tree. | |
| func isIgnoredPath(tree noder.Noder, path noder.Path) bool { | |
| in, ok := tree.(*gitignore.MatchNoder) | |
| return ok && in.PathIgnored(path) | |
| } |
Maybe this is more intuitive?
|
|
||
| func ignoredNode(tree noder.Noder, path noder.Path) (noder.Path, bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| func ignoredNode(tree noder.Noder, path noder.Path) (noder.Path, bool) { | |
| // findIgnoredNode finds the node that matches the path in the tree. | |
| // If the node is found and ignored, it returns the node and true. | |
| func findIgnoredNode(tree noder.Noder, path noder.Path) (noder.Path, bool) { |
Maybe this is more intuitive?
| if excludeIgnoredChanges && m != nil { | ||
| return w.excludeIgnoredChanges(m, c), nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need to traverse through the changes?
I think it was already done in MatchNoder and DiffTree.
| if node, ok := ignoredNode(toTree, from); ok { | ||
| if err = diffNodesSameName(&ret, ii, ii.from.current, node); err != nil { | ||
| return nil, err | ||
| } | ||
| } else { | ||
| if err = ret.AddRecursiveDelete(from); err != nil { | ||
| return nil, err | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should also be added to onlyToRemains case.
We should support ignored tree being passed as toTree, for instance resetWorktree() uses reversed diff of index and worktree.
| if ok := isIgnoredNode(to[0], from); !ok { | ||
| if err = changes.AddRecursiveDelete(from); err != nil { | ||
| return err | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as the comment above. It should also be added to the opposite.
pjbgf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I Benchmarked main vs this PR for the Linux Kernel repo, which is quite large. The changes makes it 2.5x slower while also churning 3 times more memory:
BenchmarkStatus/osfs.BoundOS-16 (main) 1 8356676003 ns/op 2158831416 B/op 15644121 allocs/op
BenchmarkStatus/osfs.BoundOS-16 (this PR) 1 20240289471 ns/op 6315067472 B/op 16515943 allocs/op
I appreciate that this may help some specific use cases, so if we were to go ahead this would need to be an opt-in, using the options that @onee-only suggested.
Overall, I think that Status could be optimised by leaning on the caching mechanisms that upstream has such as the index extensions TREE, LINK and UNTR. The support to those are being added in #1622. Once merged, they will need to be put to use in the Status logic.
| _, err = fs.Create("nested/ignore_dir/file") | ||
| s.NoError(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not closing files can lead to failing tests in Windows, so ideally we would consistently close them:
| _, err = fs.Create("nested/ignore_dir/file") | |
| s.NoError(err) | |
| f, err = fs.Create("nested/ignore_dir/file") | |
| s.NoError(err) | |
| err = f.Close() | |
| s.NoError(err) |
@pjbgf Note that this branch is currently 98 commits behind I've merged tested with the snippet above: |
|
@onee-only rebasing this PR and testing it against the Linux Kernel is still 40% slower, but to your point, it is not as bad as using the current state of the PR. The benchmark is using osfs rather than memfs. We probably should add a |

This PR contains two changes that significantly improve the performance of
WorkTree.Status()(from 491 s to 5 s):This signigifantly improves the performance of
Status():In a repository with 3M ignored files
Statusnow takes 5 s instead of 160 s..gitignorefiles in nested ignored directories.These were not matched in the current implementation, as that only checks the ignores of the current directory.
As a side-effect this change also leads to significantly improved performance: in a repository with 3M ignored files
Statusnow takes 160 s instead of 491 s.I have a backport to
masterready if desired.