Skip to main content

What Is Aegis AI

Aegis AI is the world's first AI security agent. It runs on your Mac, Windows PC, or Linux machine and senses from all your cameras.

Not "motion detected." Your cameras finally understand what's happening.

Built on Vision Language Models, Large Language Models, and Voice Language Models — Aegis describes scenes, recognizes patterns, and learns your environment. No cloud required. No subscriptions. Your data stays home.

How It Works

Aegis connects multiple AI models into a unified intelligence pipeline that transforms raw camera footage into actionable understanding:

LayerWhat It DoesExamples
Vision Model (VLM)Watches camera frames, describes the scene in natural languageSmolVLM2, Qwen 2.5 VL, Gemma 3, MiniCPM-V, LLaVA
Language Model (LLM)Reasons about descriptions, answers questions, makes decisionsBuilt-in local models, OpenAI, Anthropic, Google
AgentMaintains memory, sends alerts, executes tools, adapts to your routinesConfigurable personality with soul, voice, memory, and toolbox
SkillsModular ML capabilities — detection, segmentation, depth estimationYOLO Detection, SAM2 Segmentation, Depth Anything v2

The Pipeline in Action

Here's what happens every time a camera records a clip:

  1. Frame extraction — Aegis pulls frames from the clip at your configured rate (0.1–5 fps)
  2. VLM analysis — The active vision model sees each frame and writes a natural-language description: "Person approaching front door carrying a large brown box. Two vehicles parked in the driveway."
  3. Event evaluation — The LLM compares the description against your configured event handlers using semantic understanding — not simple keyword matching
  4. Alert delivery — If a match is found, Aegis sends a notification through your configured messaging channels (Telegram, Discord, or Slack) with the AI description and a snapshot
  5. Memory update — The agent stores relevant observations in its memory, building long-term context about your environment
  6. Timeline indexing — The clip, its AI description, and metadata are persisted to the timeline for future search and review

This entire pipeline runs continuously and automatically. You can also intervene at any point — ask the agent questions in natural language, search the timeline, or adjust event handlers — and the agent responds with full awareness of everything it has observed.

What Makes This Different From Traditional Security Cameras

Traditional security cameras record video and detect motion. That's it. You get a wall of clips with no context, and you're responsible for watching all of them.

Aegis AI is fundamentally different:

Traditional CameraAegis AI
"Motion detected at 3:42 PM""Person in a blue jacket walking up the driveway carrying a large brown box — likely a package delivery"
No understanding of contextRemembers your family members, daily routines, and known vehicles
Alert fatigue from constant false positivesNatural-language event rules that only fire when something genuinely matches
You have to watch every clip manuallyAsk "what happened last night?" and get a summary
Footage stored, never analyzedEvery clip analyzed, described, indexed, and searchable

Cloud or Local — You Choose

Local-First Privacy

Browse and download VLMs directly from HuggingFace. Everything runs on your hardware — fully offline, zero API costs. Your camera footage, AI analysis, agent memory, and chat history never leave your machine.

The built-in AI Engine powers local inference with hardware-specific optimization:

  • Apple Silicon — Metal GPU acceleration for near-realtime speeds
  • NVIDIA GPUs — CUDA acceleration for fast, efficient processing
  • CPU fallback — optimized inference for systems without a dedicated GPU

Cloud Providers

Bring your own OpenAI, Google, or Anthropic API key for maximum speed and quality. Cloud providers are optional — you pay the provider directly, and Aegis includes real-time cost estimation so you always know what you're spending.

Key Features at a Glance

FeatureDescription
Multi-camera monitoringBlink, Ring, RTSP, ONVIF, webcam, and mobile — all in one grid
AI-powered video analysisEvery clip analyzed by a VLM, producing natural-language descriptions
Natural-language alertsDefine event handlers in plain English — the LLM evaluates matches semantically
Agent with memoryPersistent memory that learns your routines, family members, and environment
Multi-channel messagingReceive alerts and hold conversations via Telegram, Discord, or Slack
Voice interactionText-to-speech with multiple local AI models, plus push-to-talk input
Skills marketplaceInstall modular AI capabilities — object detection, depth estimation, segmentation
Model trainingFine-tune custom YOLO models on your own camera data
AI video generationCreate videos and images on demand using Google Veo, Gemini, or OpenAI
Storage managementConfigurable retention policies, storage modes, and custom media paths
Cross-platformmacOS (Apple Silicon + Intel), Windows (x64), Linux (AppImage + .deb)
Fully offlineDownload local models and run without internet — no subscriptions, no cloud accounts

Up and Running in Minutes

Download from sharpai.org
Getting Started guide →