Improve Status performance with many ignored files #1379

silkeh · 2025-01-14T10:13:10Z

This PR contains two changes that significantly improve the performance of WorkTree.Status() (from 491 s to 5 s):

Skip ignored files when walking through the worktree.
This signigifantly improves the performance of Status():
In a repository with 3M ignored files Status now takes 5 s instead of 160 s.
Fix loading of .gitignore files in nested ignored directories.
These were not matched in the current implementation, as that only checks the ignores of the current directory.
As a side-effect this change also leads to significantly improved performance: in a repository with 3M ignored files Status now takes 160 s instead of 491 s.

I have a backport to master ready if desired.

pjbgf · 2025-01-15T19:39:24Z

@silkeh thanks for proposing this PR. Although this may improve the use case of multiple ignored files, it seems to cause a regression when that is not the case. I did a quick benchmark testing a call to Status() on a copy of the go-git repo, and these changes seem to have a negative impact on both allocation numbers and time per operation:

BenchmarkStatus-16 (v5.13.0)      	      48	  21514427 ns/op	19511754 B/op	   69999 allocs/op
BenchmarkStatus-16 (this PR)      	      34	  34658273 ns/op	42274701 B/op	  106194 allocs/op

I am struggling for time to pinpoint optimisation areas in the PR, would you be able to optimise it for the standard use case as well?

silkeh · 2025-01-15T19:54:42Z

@pjbgf: Thanks for taking a look! I'll take a look later this week and see what I can improve.

onee-only · 2025-06-12T07:55:18Z

Hello @silkeh, Are you still working on this? This could be a big improvement for #181.

silkeh · 2025-06-13T09:15:29Z

@onee-only: That was the plan, but I've been busy with other things. I'll try to make some time next week.

Fix loading of `.gitignore` files in nested ignored directories. These were not matched in the current implementation, as that only checks the ignores of the current directory. As a side-effect this change also leads to significantly improved performance: in a repository with 3M ignored files `Status` now takes 160 s instead of 491 s.

Skip ignored files when walking through the worktree. This signigifantly improves the performance of `Status()`: In a repository with 3M ignored files `Status` now takes 5 s instead of 160 s.

silkeh · 2025-09-12T16:04:22Z

I finally had some time to take a look and rebase the PR. The difference compared to origin/main (3a68d04) seems minor now. I got the following results:

BenchmarkStatus-12(v5.16.2)        433	  27579139 ns/op	18265183 B/op	   43611 allocs/op
BenchmarkStatus-12(origin/main)    324	  37745332 ns/op	 2106671 B/op	   41265 allocs/op
BenchmarkStatus-12(first commit)   321	  37490431 ns/op	 2107606 B/op	   41274 allocs/op
BenchmarkStatus-12(second commit)  307	  39710261 ns/op	 2174177 B/op	   42200 allocs/op

v5.16.2 is a lot faster, but this PR in its current form doesn't hurt performance too much. If you test with a repo on disk it actually seems to be a bit faster.

Benchmark I used:

func BenchmarkStatus(b *testing.B) {
	r, err := Clone(memory.NewStorage(), memfs.New(), &CloneOptions{
		URL: "https://github.com/go-git/go-git",
	})
	require.NoError(b, err)

	workTree, err := r.Worktree()
	require.NoError(b, err)

	_, err = workTree.Status()
	require.NoError(b, err)

	b.ReportAllocs()
	b.ResetTimer()

	for i := 0; i < b.N; i++ {
		workTree.Status()
	}
}

If I read the profiling correctly most of the performance degradation is introduced in MatchNoder.NumChildren because it now has to run through the patterns instead of just returning the info from disk:

I did a tiny optimization to reduce the allocations a bit, but the rest of the performance improvements would probably have to be done in the gitignore.matcher.

onee-only

Hello @silkeh! I'm looking forward to this being merged. I've left my opinions on the change. I hope these help to improve the code.

onee-only · 2025-10-26T07:49:21Z

plumbing/format/gitignore/dir.go

 // ReadPatterns reads the .git/info/exclude and then the gitignore patterns
 // recursively traversing through the directory structure. The result is in
 // the ascending order of priority (last higher).
 func ReadPatterns(fs billy.Filesystem, path []string) (ps []Pattern, err error) {


Suggested change

func ReadPatterns(fs billy.Filesystem, path []string) (ps []Pattern, err error) {

func ReadPatterns(fs billy.Filesystem, path string) (ps []Pattern, err error) {

Since exported function is now separated, can we make path to single string? Without reading it's internal behavior, I first thought it represents specific directories to read ignore patterns, not path parts to be joined.

onee-only · 2025-10-26T07:49:38Z

plumbing/format/gitignore/noder.go

+	noder.Noder
+
+	matcher  Matcher
+	invert   bool


Could you please elaborate what invert does? It seems that the value of invert is always true.

onee-only · 2025-10-26T07:50:02Z

plumbing/format/gitignore/noder.go

+type MatchNoder struct {
+	noder.Noder
+
+	matcher  Matcher
+	invert   bool
+	path     []string
+	children []noder.Noder
+}


9727b9b Introduced a Option struct in utils/merkletrie/filesystem. What do you think of adding gitignore.Matcher as an option in there? I believe we can achieve higher cohension with this.

Doing that and making IgnoreNoder interface could be a good idea for decoupling the implementation from the DiffTree().

onee-only · 2025-10-26T07:50:50Z

plumbing/format/gitignore/noder.go

+func (n *MatchNoder) findChild(node noder.Noder, name string) (noder.Noder, bool) {
+	children, _ := node.Children()
+
+	for _, child := range children {
+		if child.Name() == name {
+			return child, true
+		}
+	}
+
+	return nil, false
+}
+
+func (n *MatchNoder) noderPaths(path noder.Path) []string {
+	parts := make([]string, len(path))
+
+	for i, p := range path {
+		parts[i] = p.Name()
+	}
+
+	return parts
+}


Suggested change

func (n *MatchNoder) findChild(node noder.Noder, name string) (noder.Noder, bool) {

children, _ := node.Children()

for _, child := range children {

if child.Name() == name {

return child, true

}

}

return nil, false

}

func (n *MatchNoder) noderPaths(path noder.Path) []string {

parts := make([]string, len(path))

for i, p := range path {

parts[i] = p.Name()

}

return parts

}

func findChild(node noder.Noder, name string) (noder.Noder, bool) {

children, _ := node.Children()

for _, child := range children {

if child.Name() == name {

return child, true

}

}

return nil, false

}

func convertToStrings(path noder.Path) []string {

parts := make([]string, len(path))

for i, p := range path {

parts[i] = p.Name()

}

return parts

}

Since these functions aren't restricted to the use case of MatchNoder, we can move the functions out of the struct.

onee-only · 2025-10-26T07:56:27Z

utils/merkletrie/difftree.go

+
+func isIgnoredNode(tree noder.Noder, path noder.Path) bool {


Suggested change

func isIgnoredNode(tree noder.Noder, path noder.Path) bool {

// isIgnoredPath reports if the path is ignored in the tree.

func isIgnoredPath(tree noder.Noder, path noder.Path) bool {

in, ok := tree.(*gitignore.MatchNoder)

return ok && in.PathIgnored(path)

}

Maybe this is more intuitive?

onee-only · 2025-10-26T08:02:04Z

utils/merkletrie/difftree.go

+
+func ignoredNode(tree noder.Noder, path noder.Path) (noder.Path, bool) {


Suggested change

func ignoredNode(tree noder.Noder, path noder.Path) (noder.Path, bool) {

// findIgnoredNode finds the node that matches the path in the tree.

// If the node is found and ignored, it returns the node and true.

func findIgnoredNode(tree noder.Noder, path noder.Path) (noder.Path, bool) {

Maybe this is more intuitive?

onee-only · 2025-10-26T08:02:33Z

worktree_status.go

+	if excludeIgnoredChanges && m != nil {
+		return w.excludeIgnoredChanges(m, c), nil
 	}


Do we still need to traverse through the changes?
I think it was already done in MatchNoder and DiffTree.

onee-only · 2025-10-26T08:14:13Z

utils/merkletrie/difftree.go

+				if node, ok := ignoredNode(toTree, from); ok {
+					if err = diffNodesSameName(&ret, ii, ii.from.current, node); err != nil {
+						return nil, err
+					}
+				} else {
+					if err = ret.AddRecursiveDelete(from); err != nil {
+						return nil, err
+					}


I think this should also be added to onlyToRemains case.
We should support ignored tree being passed as toTree, for instance resetWorktree() uses reversed diff of index and worktree.

onee-only · 2025-10-26T08:15:38Z

utils/merkletrie/difftree.go

+		if ok := isIgnoredNode(to[0], from); !ok {
+			if err = changes.AddRecursiveDelete(from); err != nil {
+				return err
+			}


Same as the comment above. It should also be added to the opposite.

pjbgf

I Benchmarked main vs this PR for the Linux Kernel repo, which is quite large. The changes makes it 2.5x slower while also churning 3 times more memory:

BenchmarkStatus/osfs.BoundOS-16 (main)   	       1	 8356676003 ns/op 	2158831416 B/op	15644121 allocs/op
BenchmarkStatus/osfs.BoundOS-16 (this PR)	       1	20240289471 ns/op	6315067472 B/op	16515943 allocs/op

I appreciate that this may help some specific use cases, so if we were to go ahead this would need to be an opt-in, using the options that @onee-only suggested.

Overall, I think that Status could be optimised by leaning on the caching mechanisms that upstream has such as the index extensions TREE, LINK and UNTR. The support to those are being added in #1622. Once merged, they will need to be put to use in the Status logic.

pjbgf · 2025-10-28T07:07:30Z

plumbing/format/gitignore/dir_test.go

+	_, err = fs.Create("nested/ignore_dir/file")
+	s.NoError(err)


Not closing files can lead to failing tests in Windows, so ideally we would consistently close them:

Suggested change

_, err = fs.Create("nested/ignore_dir/file")

s.NoError(err)

f, err = fs.Create("nested/ignore_dir/file")

s.NoError(err)

err = f.Close()

s.NoError(err)

onee-only · 2025-10-28T08:55:01Z

I Benchmarked main vs this PR for the Linux Kernel repo, which is quite large. The changes makes it 2.5x slower while also churning 3 times more memory:

@pjbgf Note that this branch is currently 98 commits behind main. There were several PRs that improved performance as far as I remember.

I've merged main to this branch locally and the difference doesn't seem that large.

tested with the snippet above:

goos: darwin
goarch: arm64
pkg: github.com/go-git/go-git/v6
cpu: Apple M2
         │ before.txt  │             after.txt             │
         │   sec/op    │   sec/op     vs base              │
Status-8   11.16m ± 0%   11.83m ± 1%  +6.04% (p=0.002 n=6)

         │  before.txt  │             after.txt              │
         │     B/op     │     B/op      vs base              │
Status-8   2.104Mi ± 0%   2.138Mi ± 0%  +1.64% (p=0.002 n=6)

         │ before.txt  │             after.txt             │
         │  allocs/op  │  allocs/op   vs base              │
Status-8   46.35k ± 0%   46.80k ± 0%  +0.96% (p=0.002 n=6)

pjbgf · 2025-10-28T10:30:43Z

@onee-only rebasing this PR and testing it against the Linux Kernel is still 40% slower, but to your point, it is not as bad as using the current state of the PR.

goos: linux
goarch: amd64
pkg: github.com/go-git/go-git/v6
cpu: AMD Ryzen 7 PRO 8840HS w/ Radeon 780M Graphics
                     │ /tmp/main  │               /tmp/ig               │
                     │   sec/op   │   sec/op     vs base                │
Status/osfs.BoundOS-16   7.846 ± 4%   11.150 ± 5%  +42.12% (p=0.000 n=10)

                     │  /tmp/main   │               /tmp/ig               │
                     │     B/op     │     B/op      vs base               │
Status/osfs.BoundOS-16   805.1Mi ± 0%   809.6Mi ± 0%  +0.56% (p=0.001 n=10)

                     │  /tmp/main  │              /tmp/ig               │
                     │  allocs/op  │  allocs/op   vs base               │
Status/osfs.BoundOS-16   15.60M ± 0%   15.61M ± 0%  +0.05% (p=0.001 n=10)

The benchmark is using osfs rather than memfs. We probably should add a BenchmarkStatus to the code base so that we all use the same going forwards - ideally running across the built-in storages as they will be impacted in different manners.

pjbgf added this to the v6.0.0 milestone Jan 15, 2025

silkeh changed the base branch from v6-exp to main September 12, 2025 14:12

silkeh force-pushed the improve-ignored-status-performance branch from 25c3930 to 269fa79 Compare September 12, 2025 14:14

git: skip ignored files while walking worktree

dcfb50d

Skip ignored files when walking through the worktree. This signigifantly improves the performance of `Status()`: In a repository with 3M ignored files `Status` now takes 5 s instead of 160 s.

silkeh force-pushed the improve-ignored-status-performance branch from 269fa79 to dcfb50d Compare September 12, 2025 16:01

onee-only reviewed Oct 26, 2025

View reviewed changes

pjbgf requested changes Oct 28, 2025

View reviewed changes

	func ReadPatterns(fs billy.Filesystem, path []string) (ps []Pattern, err error) {
	func ReadPatterns(fs billy.Filesystem, path string) (ps []Pattern, err error) {


		func isIgnoredNode(tree noder.Noder, path noder.Path) bool {

-func isIgnoredNode(tree noder.Noder, path noder.Path) bool {
+// isIgnoredPath reports if the path is ignored in the tree.
+func isIgnoredPath(tree noder.Noder, path noder.Path) bool {
+	in, ok := tree.(*gitignore.MatchNoder)
+	return ok && in.PathIgnored(path)
+}


		func ignoredNode(tree noder.Noder, path noder.Path) (noder.Path, bool) {

-func ignoredNode(tree noder.Noder, path noder.Path) (noder.Path, bool) {
+// findIgnoredNode finds the node that matches the path in the tree.
+// If the node is found and ignored, it returns the node and true.
+func findIgnoredNode(tree noder.Noder, path noder.Path) (noder.Path, bool) {

Improve Status performance with many ignored files #1379

Are you sure you want to change the base?

Improve Status performance with many ignored files #1379

Uh oh!

Conversation

silkeh commented Jan 14, 2025

Uh oh!

pjbgf commented Jan 15, 2025

Uh oh!

silkeh commented Jan 15, 2025

Uh oh!

onee-only commented Jun 12, 2025

Uh oh!

silkeh commented Jun 13, 2025

Uh oh!

silkeh commented Sep 12, 2025

Uh oh!

onee-only left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pjbgf left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

onee-only commented Oct 28, 2025

Uh oh!

pjbgf commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants