Architecture

High-level system diagrams and structural documentation for my core engineering projects. Focusing on modularity, local execution, and scalable AI infrastructure.

OmniSLM Architecture

REST API (FastAPI) / WebSocket Gateway

↓

Core Orchestrator

Routing Engine

Agent Runtime

Tool Registry

↓

Memory Engine

Vector Store (FAISS) + SQLite Context

Inference Engine

Ollama / Local LLM Bridge

OmniSLM uses a layered architecture designed for extensibility. The Core Orchestrator sits between the API gateway and the underlying inference/memory engines. By abstracting the Vector Store and Inference Engine behind unified interfaces, developers can swap out FAISS for Qdrant, or Ollama for vLLM, without altering their agent logic.

RAG Pipeline Architecture

Ingestion Flow

1. Document Loaders (PDF, TXT, MD)
2. Semantic Text Splitters (Overlap allowed)
3. Local Embedding (SentenceTransformers)
4. Vector Indexing (FAISS / Pinecone)

Retrieval Flow

1. Query Expansion & Rewriting
2. Hybrid Search (Dense Vector + Sparse Keyword)
3. Cross-Encoder Re-ranking
4. Context Injection & Prompt Construction

The RAG system focuses on Hybrid Retrieval to maximize accuracy. Relying solely on dense embeddings often misses exact keyword matches (like acronyms or IDs). By combining dense vector search with sparse retrieval (e.g., BM25) and passing the combined results through a cross-encoder for re-ranking, the pipeline ensures the LLM receives the most relevant context possible.

Agent Runtime Architecture

Built using a ReAct (Reasoning and Acting) paradigm tailored for smaller context windows. Instead of overwhelming an 8B model with 50 tools, the agent runtime uses a hierarchical routing architecture. A lightweight classifier model selects a specialized sub-agent, which is then provisioned with only the 3-5 tools necessary for its specific domain.

Spring AI Multi-Tenant Platform

Tenant Context Filter

→

Spring AI Client
(Dynamic Routing)

→

Ollama (Local)

OpenAI (Fallback)

In the Java ecosystem, the Local LLM Platform leverages Spring Boot's ThreadLocal context (or Reactor Context for WebFlux) to inject tenant IDs into every AI request. This guarantees isolated vector searches and allows per-tenant model configurations (e.g., Tenant A uses local Llama 3 for privacy, Tenant B uses GPT-4 for complex reasoning).

Blockchain + AI Architecture

Used in SeedTracking. The architecture creates a clear separation between deterministic consensus and probabilistic inference. Smart contracts on Ethereum govern state transitions (e.g., transferring seed ownership), while an off-chain Python microservice listens to contract events. When a transfer occurs, the Python service fetches the IPFS metadata, runs an ML fraud-detection model, and writes a risk-score back to the blockchain via an Oracle pattern.