What Is Aegis AI
Aegis AI is the world's first AI security agent. It runs on your Mac, Windows PC, or Linux machine and senses from all your cameras.
Not "motion detected." Your cameras finally understand what's happening.
Built on Vision Language Models, Large Language Models, and Voice Language Models — Aegis describes scenes, recognizes patterns, and learns your environment. No cloud required. No subscriptions. Your data stays home.
How It Works
Aegis connects multiple AI models into a unified intelligence pipeline that transforms raw camera footage into actionable understanding:
| Layer | What It Does | Examples |
|---|---|---|
| Vision Model (VLM) | Watches camera frames, describes the scene in natural language | SmolVLM2, Qwen 2.5 VL, Gemma 3, MiniCPM-V, LLaVA |
| Language Model (LLM) | Reasons about descriptions, answers questions, makes decisions | Built-in local models, OpenAI, Anthropic, Google |
| Agent | Maintains memory, sends alerts, executes tools, adapts to your routines | Configurable personality with soul, voice, memory, and toolbox |
| Skills | Modular ML capabilities — detection, segmentation, depth estimation | YOLO Detection, SAM2 Segmentation, Depth Anything v2 |
The Pipeline in Action
Here's what happens every time a camera records a clip:
- Frame extraction — Aegis pulls frames from the clip at your configured rate (0.1–5 fps)
- VLM analysis — The active vision model sees each frame and writes a natural-language description: "Person approaching front door carrying a large brown box. Two vehicles parked in the driveway."
- Event evaluation — The LLM compares the description against your configured event handlers using semantic understanding — not simple keyword matching
- Alert delivery — If a match is found, Aegis sends a notification through your configured messaging channels (Telegram, Discord, or Slack) with the AI description and a snapshot
- Memory update — The agent stores relevant observations in its memory, building long-term context about your environment
- Timeline indexing — The clip, its AI description, and metadata are persisted to the timeline for future search and review
This entire pipeline runs continuously and automatically. You can also intervene at any point — ask the agent questions in natural language, search the timeline, or adjust event handlers — and the agent responds with full awareness of everything it has observed.
What Makes This Different From Traditional Security Cameras
Traditional security cameras record video and detect motion. That's it. You get a wall of clips with no context, and you're responsible for watching all of them.
Aegis AI is fundamentally different:
| Traditional Camera | Aegis AI |
|---|---|
| "Motion detected at 3:42 PM" | "Person in a blue jacket walking up the driveway carrying a large brown box — likely a package delivery" |
| No understanding of context | Remembers your family members, daily routines, and known vehicles |
| Alert fatigue from constant false positives | Natural-language event rules that only fire when something genuinely matches |
| You have to watch every clip manually | Ask "what happened last night?" and get a summary |
| Footage stored, never analyzed | Every clip analyzed, described, indexed, and searchable |
Cloud or Local — You Choose
Local-First Privacy
Browse and download VLMs directly from HuggingFace. Everything runs on your hardware — fully offline, zero API costs. Your camera footage, AI analysis, agent memory, and chat history never leave your machine.
The built-in AI Engine powers local inference with hardware-specific optimization:
- Apple Silicon — Metal GPU acceleration for near-realtime speeds
- NVIDIA GPUs — CUDA acceleration for fast, efficient processing
- CPU fallback — optimized inference for systems without a dedicated GPU
Cloud Providers
Bring your own OpenAI, Google, or Anthropic API key for maximum speed and quality. Cloud providers are optional — you pay the provider directly, and Aegis includes real-time cost estimation so you always know what you're spending.
Key Features at a Glance
| Feature | Description |
|---|---|
| Multi-camera monitoring | Blink, Ring, RTSP, ONVIF, webcam, and mobile — all in one grid |
| AI-powered video analysis | Every clip analyzed by a VLM, producing natural-language descriptions |
| Natural-language alerts | Define event handlers in plain English — the LLM evaluates matches semantically |
| Agent with memory | Persistent memory that learns your routines, family members, and environment |
| Multi-channel messaging | Receive alerts and hold conversations via Telegram, Discord, or Slack |
| Voice interaction | Text-to-speech with multiple local AI models, plus push-to-talk input |
| Skills marketplace | Install modular AI capabilities — object detection, depth estimation, segmentation |
| Model training | Fine-tune custom YOLO models on your own camera data |
| AI video generation | Create videos and images on demand using Google Veo, Gemini, or OpenAI |
| Storage management | Configurable retention policies, storage modes, and custom media paths |
| Cross-platform | macOS (Apple Silicon + Intel), Windows (x64), Linux (AppImage + .deb) |
| Fully offline | Download local models and run without internet — no subscriptions, no cloud accounts |