Skip to content

sumanpokhrel-11/symspell-c99

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SymSpell C99

A blazingly fast spell-checking library - Pure C99 implementation of Wolf Garbe's SymSpell algorithm.

License: MIT Language POSIX

By Suman Pokhrel


Why SymSpell?

SymSpell is 1 million times faster than traditional spell-checkers. Instead of generating all possible corrections for a misspelling, it pre-computes deletions of dictionary words, enabling lightning-fast lookups.

Original algorithm by Wolf Garbe


Features

  • ⚑ Fast: Average lookup time ~5Β΅s (real-world usage)*
  • 🎯 Accurate: 82-84% correction rate on standard test sets
  • 🧹 Clean: Zero compiler warnings, ~700 lines of well-structured code
  • 🌐 Portable: POSIX-compliant, works on any Unix system
  • βœ… Complete: Includes tests, benchmarks, and dictionary tools
  • πŸ“¦ Self-contained: No external dependencies

Quick Start

make
./test_symspell dictionaries/dictionary.txt

Interactive mode:

Enter words to correct (or 'quit' to exit):
recieve
  Suggestions:
    receive (distance=1, iwf=7.234567, prob=0.001234, freq=234567)

Usage

#include "symspell.h"

// Create dictionary with max edit distance=2, prefix length=7
symspell_dict_t* dict = symspell_create(2, 7);

// Load dictionary (86k words optimized for spell-checking)
symspell_load_dictionary(dict, "dictionaries/dictionary.txt", 0, 1);

// Lookup suggestions
symspell_suggestion_t suggestions[5];
int count = symspell_lookup(dict, "speling", 2, suggestions, 5);

// Print results
for (int i = 0; i < count; i++) {
    printf("%s (distance=%d)\n", 
           suggestions[i].term, 
           suggestions[i].distance);
}

// Cleanup
symspell_destroy(dict);

Performance

Average lookup time:     ~5Β΅s (real-world usage)*
Misspelling correction:  ~30Β΅s (worst case)
Correctly spelled:       ~0.7Β΅s (fast path)
Correction accuracy:     82-84%
Dictionary size:         86,060 words
Memory usage:            ~45MB (with deletes index)

*Based on 15% real-world error rate in user-typed text (2-3% error per character Γ— 4.7 average characters per word)

Tested on Apple M4, comparable results on x86.


Building

Requirements:

  • C99 compiler (GCC, Clang)
  • Make
  • POSIX-compliant system (Linux, macOS, BSD)

Compile:

make              # Build test and benchmark programs
make test         # Build and run tests
make benchmark    # Build and run benchmark
make clean        # Remove built programs

Manual compilation:

gcc -std=c99 -O2 -Iinclude test/test_symspell.c src/symspell.c -o test_symspell

Understanding Results

Edit Distance Matters

SymSpell returns suggestions based on edit distance (number of character changes needed):

# "helo" with max_distance=1
$ ./test_symspell dictionaries/dictionary.txt
> helo
  held (distance=1)    # One substitution: o→d

# "helo" with max_distance=2
$ ./test_symspell dictionaries/dictionary.txt --max-distance 2
> helo
  hello (distance=2)   # Two insertions: +l, +o
  held (distance=1)    # Still shown as closer match

The algorithm prefers closer matches. If you want "hello" as a suggestion, increase max_edit_distance when calling symspell_lookup().

Why Not 100% Accuracy?

Real-world typos are ambiguous:

  • "helo" β†’ "held" or "hello"?
  • "teh" β†’ "the" or "tea"?
  • "there" β†’ correct word or "their"?

Even commercial spell-checkers (Microsoft Word, Google Docs) achieve 80-85% accuracy on standardized test sets because:

  1. Language is inherently ambiguous
  2. Context is needed for many corrections
  3. Trade-offs exist between precision and recall

Our 82-84% matches the original SymSpell paper and is competitive with commercial solutions.

Performance Context

"5Β΅s average" means:

  • βœ… 200x faster than Python implementations (~1000Β΅s)
  • βœ… 2000x faster than traditional approaches (~10000Β΅s)
  • βœ… Can check 1000 words in 5ms (imperceptible to users)
  • βœ… Real-time spell-checking with zero UI lag

For most applications, this is more than fast enough.


Dictionaries

Main dictionary (dictionaries/dictionary.txt): 86,060 words

  • Optimized for spell-checking accuracy (82-84%)
  • Balanced across multiple test sets
  • Recommended for general use

Extras dictionary (dictionaries/dictionary-extras.txt): 25,327 words

  • Technical terms and acronyms
  • For keyword extraction and IWF scoring
  • DO NOT use for spell-checking (reduces accuracy)

Building custom dictionaries: See dictionaries/reference/ for source materials and tools. Full documentation in docs/DICTIONARY.md.


Project Structure

symspell-c99/
β”œβ”€β”€ src/
β”‚   └── symspell.c          # Implementation (~700 lines)
β”œβ”€β”€ include/
β”‚   β”œβ”€β”€ symspell.h          # Public API
β”‚   β”œβ”€β”€ hash.h              # Hash table implementation
β”‚   β”œβ”€β”€ xxh3.h              # xxHash for fast hashing
β”‚   └── posix.h             # POSIX compatibility layer
β”œβ”€β”€ test/
β”‚   β”œβ”€β”€ test_symspell.c     # Interactive test program
β”‚   β”œβ”€β”€ benchmark_symspell.c # Performance benchmarks
β”‚   └── data/               # Test datasets
β”œβ”€β”€ dictionaries/
β”‚   β”œβ”€β”€ dictionary.txt      # Main 86k word dictionary
β”‚   └── reference/          # Dictionary building tools
└── docs/
    β”œβ”€β”€ SYMSPELL.md         # Algorithm details
    └── DICTIONARY.md       # Dictionary customization

Documentation


Algorithm

SymSpell uses Symmetric Delete spelling correction:

  1. Pre-computation: Generate all deletions up to edit distance N for each dictionary word
  2. Lookup: Generate deletions of the input word
  3. Match: Find dictionary words that share these deletions
  4. Rank: Sort by edit distance and word frequency

This trades memory for speed - pre-computing deletions makes lookups extremely fast.


Testing

Run the test program:

./test_symspell dictionaries/dictionary.txt

Run benchmarks:

./benchmark_symspell dictionaries/dictionary.txt test/data/symspell/misspellings/misspell-codespell.txt

FAQ

Q: Why does "helo" suggest "held" instead of "hello"?

A: "held" has edit distance 1 (one character change), while "hello" has edit distance 2 (two insertions). SymSpell prefers closer matches. Increase max_edit_distance to get more suggestions.

Q: Why only 82-84% accuracy?

A: This is competitive with commercial spell-checkers and matches the original SymSpell paper. Perfect accuracy is impossible due to language ambiguity.

Q: Compilation fails with undefined reference to log

A: Add -lm flag to link the math library:

gcc -std=c99 -O2 -Iinclude test/test_symspell.c src/symspell.c -lm -o test_symspell

License

MIT License - See LICENSE file for details.

This implementation is independent but follows the algorithm described in:


Credits

  • Algorithm: Wolf Garbe - Creator of SymSpell
  • Implementation: Suman Pokhrel - Pure C99 port

Special thanks to Dr. James Freeman for his mentorship and guidance on this project.


Author

Suman Pokhrel


Contributing

This is the first pure C99 implementation of SymSpell. Contributions welcome:

  • Bug reports and fixes
  • Performance improvements
  • Dictionary enhancements
  • Documentation improvements
  • Platform-specific optimizations

Please open an issue or pull request on GitHub!


See Also


This is the first and only pure C99 implementation of SymSpell, designed for performance, portability, and clean code.