GenAI Advanced

This page focuses on how GenAI models work internally: understanding embeddings and vectors, exploring advanced techniques like RAG and function calling, and building agentic systems that go beyond simple prompts.

Vectors and embeddings: How AI understands meaning
From embeddings to responses: The inference process
Neural networks and transformers
Attention mechanism
Context windows and model parameters
Alignment: making models follow principles
Fine-tuning a model
Function calling
Model Context Protocol (MCP)
Retrieval Augmented Generation (RAG)
Agents and agentic AI
Multi-agent solutions
Scaling AI implementations
The AI-native web: NLWeb, llms.txt, and semantic search

Vectors and embeddings: How AI understands meaning

Everything an AI processes—words, images, concepts—gets converted into vectors, which are simply lists of numbers. Think of a vector as a precise coordinate in multi-dimensional space that captures the essence of what something means.

Embeddings are sophisticated vectors that capture semantic meaning. When an AI learns that “dog” and “puppy” are related, it places their embeddings close together in this mathematical space. Similarly, “king” minus “man” plus “woman” might land near “queen”—the model has learned relationships between concepts through the geometric arrangement of their embeddings.

This mathematical representation allows AI models to understand that “vehicle” relates to both “car” and “bicycle,” even if those specific connections weren’t explicitly taught. The model discovers these relationships by observing patterns in how words appear together across millions of examples.

graph LR
    Word1["Word: 'dog'"]
    Word2["Word: 'puppy'"]
    Word3["Word: 'car'"]
    Vec1["Vector: [0.2, 0.8, 0.3, ...]"]
    Vec2["Vector: [0.19, 0.82, 0.28, ...]"]
    Vec3["Vector: [0.7, 0.1, 0.9, ...]"]
    Space["Multi-dimensional space where semantic relationships are captured by proximity"]
    
    Word1 --> Vec1
    Word2 --> Vec2
    Word3 --> Vec3
    Vec1 --> Space
    Vec2 --> Space
    Vec3 --> Space
    
    style Word1 fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6
    style Word2 fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6
    style Word3 fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6
    style Vec1 fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6
    style Vec2 fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6
    style Vec3 fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6
    style Space fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6

From embeddings to responses: The inference process

Inference is what happens when you send a prompt to an AI model and receive a response. The model converts your words into embeddings, processes those mathematical representations through its neural network, and converts the results back into human-readable text.

During inference, the model doesn’t “think” the way humans do. Instead, it performs billions of mathematical calculations to predict the most likely next word, then the word after that, building responses token by token based on the patterns it learned during training.

Neural networks and transformers

Now that we understand how AI models represent and process information, we can explore the sophisticated mechanisms that make modern AI so powerful.

The foundation of learning

A neural network mimics how biological brains process information through interconnected nodes. Each connection has a weight—a number that determines how much influence one piece of information has on another. During training, the model adjusts billions of these weights to improve its predictions.

Transformers: A revolutionary architecture

Modern language models use transformers, a revolutionary architecture that changed how AI understands language. Unlike earlier approaches that processed text sequentially (word by word), transformers can examine entire passages simultaneously and understand relationships between any words, regardless of how far apart they appear.

Attention mechanism

The breakthrough innovation in transformers is the attention mechanism. When generating each word, the model can “attend to” or focus on the most relevant parts of the input, just as you might reread key phrases when writing a response to a complex question.

For example, when translating “The cat that was sleeping on the mat was orange,” the attention mechanism helps the model understand that “orange” describes “cat,” not “mat,” even though other words appear between them.

Context windows and model parameters

Parameters and model capability

The parameters in a model (the adjustable weights we mentioned) directly impact capability. GPT-3 has 175 billion parameters, while some newer models have over a trillion. More parameters generally mean better understanding of nuanced language patterns, though they also require more computational resources.

Context windows

Context windows determine how much information a model can consider at once. Larger context windows allow models to maintain coherence across longer conversations and documents, but they also increase computational costs and processing time.

Training data and knowledge cutoff

The training data (billions of web pages, books, and articles) shapes what the model knows. The cut-off date represents the latest information in this training data, which is why models can’t discuss events that happened after their training completed.

Practical implications: Balancing trade-offs

Every advanced feature involves trade-offs. Larger context windows enable more sophisticated reasoning but increase latency and costs. Higher-parameter models provide better quality but require more computational resources. Understanding these trade-offs helps you choose the right model configuration for your specific needs.

When designing applications, consider how vocabulary size (the tokens a model understands), temperature settings (creativity vs. consistency), and seed values (reproducibility) align with your goals for latency, accuracy, cost, and reliability.

For a comprehensive deep dive into how these concepts work together, Andrej Karpathy’s tutorial on building ChatGPT from scratch provides an excellent technical foundation.

More information:

Alignment: making models follow principles

As models get stronger, we also need them to behave safely and predictably. “Alignment” is the process of shaping model behavior to follow clear rules and values, not just statistics from the training data.

Constitutional AI

A practical approach you’ll see in modern systems is Constitutional AI (popularized by Anthropic): the model uses a short, written set of principles (a “constitution”) to critique and improve its own answers.

How this works in practice:

Draft: the model produces an initial answer.
Self-critique: it reviews that answer against the constitution (e.g., be helpful and honest, avoid facilitating harm, acknowledge uncertainty).
Revise: it edits the answer to better follow the principles.
Preference training (RLAIF): training then favors these revised answers using reinforcement learning from AI feedback, reducing dependence on large human-labeled datasets.

Why this helps

Principles are explicit and auditable.
Scales alignment with fewer human labels.
Produces more consistent “helpful, harmless, honest” behavior.

Limits to keep in mind

Only as good as the chosen principles (they can be incomplete or biased) and may lead to over-refusal in edge cases.
Not a substitute for factual grounding—use retrieval (RAG) and tools for accuracy and citations.

Example principle: “Avoid providing instructions that meaningfully facilitate wrongdoing.” During self-critique, the model removes or reframes such content before replying.

Tip: keep these concepts practical. As you design a use case, tie terms like “context,” “embeddings,” and “attention” to concrete trade-offs: latency, accuracy, token cost, and guardrails.

Fine-tuning a model

Fine-tuning involves adjusting model behavior and output to better match your specific needs. While full model retraining requires significant resources, you can influence model behavior through several techniques:

Grounding Grounding provides the AI with specific, factual information to base its responses on. Instead of relying on the model’s training data, you supply current, accurate information within your prompt. For example, when asking about company policies, include the actual policy text in your prompt rather than assuming the model knows current details.

Temperature Temperature controls how creative or predictable the AI’s responses are:

Low temperature (0.0-0.3): More focused and consistent responses, good for factual tasks
Medium temperature (0.4-0.7): Balanced creativity and consistency, suitable for most general tasks
High temperature (0.8-1.0): More creative and varied responses, useful for brainstorming or creative writing

Top P (nucleus sampling) Top P determines how many alternative words the model considers when generating each token:

Low Top P (0.1-0.5): More focused responses using only the most likely word choices
High Top P (0.8-1.0): More diverse responses considering a wider range of possible words

These settings work together - you might use low temperature and low Top P for consistent, factual responses, or high temperature and high Top P for creative brainstorming sessions.

More information:

Function calling

Function calling allows AI models to use external tools and services during their responses. Instead of only generating text, the model can call predefined functions to perform specific actions like checking the weather, calculating mathematical expressions, or retrieving current information from databases.

How it works:

You define functions with clear descriptions of what they do and what parameters they need
The AI model analyzes your prompt and determines if any functions would help answer your question
The model calls the appropriate function with the right parameters
The function returns results, which the model incorporates into its response

sequenceDiagram
    participant User
    participant AI as AI Model
    participant Function
    
    User->>AI: "How long does it take to fly from New York to Los Angeles?"
    AI->>AI: Analyzes prompt
    AI->>Function: get_flight_duration("JFK", "LAX", true)
    Function->>AI: "6 hours 30 minutes including one layover"
    AI->>User: "The flight from New York to Los Angeles takes about 6 hours and 30 minutes, including one layover."
    
    style User fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6
    style AI fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6
    style Function fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6

Example function definition:

Function: get_flight_duration
Description: Calculate flight duration between two airports
Parameters:
  - departure_airport: IATA airport code (e.g., "JFK", "LAX")
  - arrival_airport: IATA airport code (e.g., "JFK", "LAX")
  - include_layovers: Boolean, whether to include connection time

Example usage:
User: "How long does it take to fly from New York to Los Angeles?"
Model: Calls get_flight_duration("JFK", "LAX", true)
Function returns: "6 hours 30 minutes including one layover"

How the model matches functions to prompts:

Models use the function descriptions and parameter details to understand when a function is relevant. They look for keywords, context clues, and the type of information being requested. The better your function descriptions, the more accurately the model will know when and how to use them.

Benefits:

Access to real-time information
Ability to perform precise calculations
Integration with external systems and databases
More accurate and up-to-date responses

More information:

Model Context Protocol (MCP)

What is MCP and what problem does it solve? Model Context Protocol is an open standard that enables AI models to securely connect to external data sources and tools. Before MCP, each AI application had to build custom integrations for every service they wanted to connect to. MCP creates a standardized way for AI models to access external resources, making it easier to build AI applications that can interact with real-world systems.

Key components:

Host: The application that contains the AI model (like your IDE, chat application, or development environment)
Client: The component that communicates with MCP servers on behalf of the AI model
Server: The service that provides access to external resources like databases, APIs, or file systems

graph TB
    Host["Host Application<br/>(VS Code, Claude Desktop,<br/>Custom App)"]
    Client["MCP Client<br/>(Protocol Handler)"]
    
    Server1["MCP Server:<br/>Database Access"]
    Server2["MCP Server:<br/>File System"]
    Server3["MCP Server:<br/>External APIs"]
    
    DB[("Database")]
    FS[("Files")]
    API[("APIs")]
    
    Host --> Client
    Client --> Server1
    Client --> Server2
    Client --> Server3
    Server1 --> DB
    Server2 --> FS
    Server3 --> API
    
    style Host fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6
    style Client fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6
    style Server1 fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6
    style Server2 fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6
    style Server3 fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6
    style DB fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6
    style FS fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6
    style API fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6

How does it relate to OpenAI function calling? MCP and OpenAI function calling serve similar purposes but work at different levels:

Function calling is a feature within specific AI models that allows them to call predefined functions
MCP is a protocol that standardizes how AI applications connect to external services, which can then expose functions to the AI

Think of function calling as the language AI models use to request external actions, while MCP is the standardized postal service that delivers those requests to the right destinations.

Security considerations: MCP is a protocol, not a deployment model. The security properties you get depend on the transport you use (for example, local stdio vs HTTP) and how you deploy the server.

Authorization is optional in MCP. Some servers expose tools without any built-in auth, while others can be deployed behind an identity-aware gateway.
For HTTP-based transports, the MCP specification describes an OAuth 2.1-based authorization approach (see the authorization specification). Support varies by server and client.
For local stdio servers, the HTTP authorization spec does not apply; credentials typically come from the local environment or configuration rather than interactive OAuth flows.

If you want to use MCP in production, focus on controls that are independent of any single server implementation:

Put MCP servers behind authentication and authorization you control (gateway, reverse proxy, or platform-native identity)
Apply least privilege to tool scopes and downstream API permissions
Isolate servers (and their credentials) per environment and, when needed, per tenant/user
Log and audit tool invocations, and treat tool outputs as untrusted input

Risks to consider:

MCP servers can access external systems, so proper security and access controls are essential
Always validate and sanitize data from external sources
Consider the privacy implications of connecting AI models to sensitive data sources

Learning resources:

MCP course on Hugging Face provides comprehensive training
Microsoft is working on enhanced MCP support with better security features

More information:

Retrieval Augmented Generation (RAG)

What is RAG and why is it important? Retrieval Augmented Generation combines the power of AI language models with access to specific, up-to-date information from external sources. Instead of relying solely on the AI’s training data (which has a cut-off date), RAG allows the model to retrieve relevant information from documents, databases, or knowledge bases in real-time and use that information to generate more accurate responses.

How RAG works:

Your question is processed to understand what information is needed
A search system finds relevant documents or data from your knowledge base
The retrieved information is combined with your original question
The AI model generates a response based on both your question and the retrieved information

graph TB
    User["User Query:<br/>'What is our refund policy?'"]
    Embed["Convert to<br/>Vector Embedding"]
    Search["Vector Search<br/>in Knowledge Base"]
    KB[("Knowledge Base<br/>Docs, Policies,<br/>FAQs")]
    Results["Retrieve Relevant<br/>Documents"]
    Combine["Combine Query +<br/>Retrieved Context"]
    LLM["LLM Generates<br/>Response"]
    Response["'Our refund policy allows...<br/>Source: Policy Doc v2.3'"]
    
    User --> Embed
    Embed --> Search
    Search --> KB
    KB --> Results
    Results --> Combine
    User --> Combine
    Combine --> LLM
    LLM --> Response
    
    style User fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6
    style Embed fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6
    style Search fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6
    style KB fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6
    style Results fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6
    style Combine fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6
    style LLM fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6
    style Response fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6

Why RAG is valuable:

Provides access to current information beyond the model’s training cut-off
Allows AI to work with your specific company data and documents
Reduces hallucinations by grounding responses in factual sources
Enables AI to cite sources and provide verifiable information

How does it differ from MCP and function calling?

RAG is primarily about retrieving and using information from documents and knowledge bases. It’s focused on finding relevant text or data to inform the AI’s response.

MCP provides a standardized protocol for AI models to connect to various external services and tools, which could include RAG systems but also databases, APIs, and other services.

Function calling is the mechanism AI models use to invoke specific operations, which could include RAG searches, MCP server interactions, or direct API calls.

When to use each approach:

Use RAG when:

You need AI to answer questions about specific documents or knowledge bases
You want responses grounded in verifiable sources
You’re dealing with information that changes frequently
You need to work with proprietary or domain-specific content

Use MCP when:

You need standardized connections to multiple external services n- You want to build reusable integrations across different AI applications
You need secure, protocol-based access to external resources

Use function calling when:

You need the AI to perform specific actions (calculations, API calls, data operations)
You want direct control over what external services the AI can access
You’re building custom integrations for specific use cases

More information:

Agents and agentic AI

What makes something an agent? An AI agent is a system that can autonomously perform tasks, make decisions, and interact with external environments to achieve specific goals. Unlike simple AI models that respond to individual prompts, agents can:

Plan multi-step tasks
Use tools and external services
Learn from feedback and adapt their approach
Operate with some degree of independence
Maintain context across multiple interactions

Is there a formal definition or interface? While there’s no single universal definition, most AI agents share common characteristics:

Autonomy: Can operate without constant human intervention
Goal-oriented: Work toward specific objectives
Environment interaction: Can perceive and act upon their environment
Tool use: Can access and utilize external resources
Planning: Can break down complex tasks into manageable steps

What’s the difference compared to MCP servers? MCP servers provide specific services and tools that AI models can access through a standardized protocol. They’re typically focused on particular functions (like database access or file management).

AI agents use tools and services (potentially including MCP servers) to accomplish broader goals. An agent might use multiple MCP servers, APIs, and other resources to complete complex, multi-step tasks.

Think of MCP servers as specialized tools in a workshop, while AI agents are the skilled craftspeople who use those tools to complete projects.

What does “agentic” mean? “Agentic” describes AI systems that exhibit agent-like behaviors - the ability to act independently, make decisions, and pursue goals with minimal human oversight. Agentic AI can:

Take initiative to solve problems
Adapt strategies based on results
Handle unexpected situations
Work toward long-term objectives
Coordinate with other systems or agents

Examples of agentic AI:

Personal assistants that can book appointments, send emails, and manage schedules
Code assistants that can analyze codebases, identify issues, and implement fixes
Research agents that can gather information from multiple sources and synthesize findings
Customer service agents that can resolve issues across multiple systems and departments

More information:

Multi-agent solutions

Multi-agent intelligence views your application as a team: each agent brings a specific skill, and together they pursue a shared goal. This shift from a single “do-everything” assistant to a collaborating group pays off when you want clearer responsibilities, predictable behavior, and outputs you can verify. As systems grow, that separation of concerns is what keeps them understandable and operable.

Core principles

Effective multi-agent systems start by breaking work into bounded subtasks with crisp objectives. Those subtasks are then assigned to specialized agents: one excels at retrieval, another at planning, a third at coding, and a fourth at review. An orchestrator (or router) selects which agent should act next and, where possible, makes that choice deterministically so runs are reproducible.

graph TB
    User["User Request:<br/>'Build a feature with tests'"]
    Orch["Orchestrator<br/>(Routes tasks to agents)"]
    
    Plan["Planning Agent<br/>(Breaks down task)"]
    Code["Coding Agent<br/>(Implements solution)"]
    Review["Review Agent<br/>(Checks quality)"]
    Test["Testing Agent<br/>(Writes tests)"]
    
    Result["Complete Solution<br/>with Tests"]
    
    User --> Orch
    Orch --> Plan
    Plan --> Orch
    Orch --> Code
    Code --> Orch
    Orch --> Test
    Test --> Orch
    Orch --> Review
    Review --> Orch
    Orch --> Result
    
    style User fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6
    style Orch fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6
    style Plan fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6
    style Code fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6
    style Review fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6
    style Test fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6
    style Result fill:#2d2d4a,color:#e0e0e0,stroke:#64b5f6

Information flows as compact artifacts (file identifiers, summaries, and links), so context stays short and handoffs remain explicit.
Guardrails (least-privilege identities, policy checks, and explicit stop conditions) keep loops in check and scope contained.
Evaluation closes the feedback loop: define success criteria, measure outcomes, and feed results back into the process.
Across the system, observability and provenance matter: log handoffs, tool calls, and sources. Keep cost and latency in check by parallelizing independent work and capping tokens and turns.

Coordination models

Coordination models describe how control and data move between agents. There isn’t a single model. In practice, you combine three choices:

Control pattern: who decides and in what order:
- Orchestrator–worker (also called planner/router): a single coordinator chooses the next agent and enforces sequence.
- Decentralized/peer: agents trigger or negotiate with each other without a central coordinator.
Execution topology: how the work is scheduled:
- Serial/pipeline: dependent steps run one after another.
- Parallel fan-out/fan-in: independent subtasks run concurrently and merge when all are done.
State sharing: how agents exchange context:
- Shared memory/blackboard: agents post and read structured artifacts (IDs, summaries, links) from a common store.
- Direct messages: agents hand artifacts to specific peers.

How these relate:

Orchestrator–worker is about control. You can still run fan-out/fan-in or a serial pipeline under an orchestrator. The orchestrator decides who acts when.
Fan-out/fan-in is about topology. It pairs with either centralized (orchestrator) or decentralized control.
Shared memory is about state. It works with both approaches to persist intermediate artifacts without over-sharing raw context.

These building blocks scale from small workflows to complex pipelines without changing the mental model.

MCP and A2A in the architecture

Two protocols help anchor the architecture.

MCP (Model Context Protocol) standardizes how agents access tools and data: servers expose capabilities, and hosts route requests with consistent security and observability. It avoids one-off integrations and keeps tool use uniform across agents.

A2A (agent-to-agent) covers how agents talk to each other: structured messages and artifacts for planning, handoffs, and reconciliation. When multiple agents must coordinate, A2A turns ad-hoc prompt passing into a predictable contract.

ACP is an emerging specification that aims to standardize A2A message formats and interaction patterns.

Together: MCP connects agents to the outside world. A2A connects agents to each other. MCP keeps tool access consistent, and A2A keeps collaboration predictable.

When to adopt multi-agent designs

Choose multi-agent designs when distinct competencies are clearer and safer than one large prompt, such as retrieval versus code generation, or when you need separation of duties like a policy checker or reviewer. They also shine when you can exploit fan-out/fan-in across independent subtasks to shorten wall-clock time, or when stronger assurance and isolation matter, for example by running different agents under least-privilege identities.

Design guidance

Make handoffs explicit: define schemas that capture the goal, inputs, constraints, evidence, and success criteria. Pass artifacts by reference (file IDs or links) and keep messages minimal to control context growth.

Bound execution with token/turn caps and clear exit conditions.

Capture the trail: log every handoff and tool call, including sources, for traceability. Finally, build an evaluation harness that exercises end-to-end scenarios so you can quantify quality, prevent regressions, and iterate safely.

More information:

Scaling AI implementations

Scaled GenAI refers to deploying generative AI solutions across entire organizations or large user bases. This requires considerations around infrastructure, cost management, quality control, security, and governance. Companies implementing scaled GenAI need to think about how to maintain consistency, manage costs, and ensure responsible use across thousands of users and use cases.

Key considerations for scaling AI:

Infrastructure planning: Ensuring adequate computational resources and network capacity
Cost management: Monitoring and optimizing AI usage costs across the organization
Quality control: Maintaining consistent AI outputs and performance standards
Security and compliance: Protecting sensitive data and meeting regulatory requirements
Governance frameworks: Establishing policies for appropriate AI use and oversight
Change management: Training users and managing the transition to AI-enhanced workflows

AI Center of Excellence (CCoE)

An AI CCoE is a cross-functional hub that accelerates safe, consistent, and cost-effective AI adoption at scale by centralizing strategy, governance, platforms, and skills.

What it does:

Strategic guidance: enterprise AI vision, roadmaps, business case/ROI models
Governance and standards: responsible AI policy, risk and compliance controls, audit processes
Technical enablement: shared AI platforms, reference architectures, MLOps, tooling
Knowledge sharing: best practices, communities of practice, reuse catalogs
Talent development: training paths, certification, mentorship

Lean structure (typical core roles):

Director (strategy and executive alignment)
Technical lead (architecture and platform)
Business liaison (intake, value, adoption)
Ethics/compliance officer (responsible AI, legal)
Program manager (portfolio and delivery)

Operating model (lightweight but enforced):

Intake and prioritization: clear request template and value/risk scoring
Standard lifecycle: quality gates for data, evals, security, and responsible-AI checks
Support and operations: monitoring, incident handling, cost/perf optimization

Phased rollout (fastest path to impact):

Phase 1: Foundation (3 months) — team, inventory, initial policy, comms
Phase 2: Pilots (3–6 months) — 2–3 business-value pilots on the shared platform
Phase 3: Scale (6–9 months) — replicate patterns, expand governance and literacy

Measure what matters (sample KPIs):

Time to production (target 3–6 months), component reuse rate (≥60%)
Model quality/compliance (≥90% production-ready, incident reduction)
Business impact (ROI uplift, adoption rates), reliability (uptime)

Tip: Pair the CCoE with centralized platforms (for consistency and cost control) plus sandbox spaces (to keep innovation fast), and apply least-privilege access throughout.

See: Building a Center of Excellence for AI: A Strategic Roadmap for Enterprise Adoption.

The AI-native web: NLWeb, llms.txt, and semantic search

AI is changing how we navigate websites and data. Instead of clicking through menus and forms, we’ll increasingly describe what we want in natural language. Sites and apps will respond by resolving intent, pulling the right data, and assembling answers with sources. Three related ideas are emerging that make this possible:

Semantic search (and why it matters)

Traditional search matches exact words. Semantic search matches meaning using embeddings (numeric representations of text, images, or other data). This lets users ask questions in their own words and still find the right content. In practice, semantic search powers Retrieval-Augmented Generation (RAG), site search that understands synonyms and context, and cross-type discovery (e.g., “the video that explains streaming tokens”).

NLWeb (natural-language web)

NLWeb refers to patterns that make the web conversational by default. Pages expose capabilities (search, lookup, actions) as structured affordances that AI agents can call. Content is organized as artifacts with clear identifiers and metadata. Users ask for outcomes (“Find the latest pricing and compare to last quarter”), and the site resolves the request through tools and data rather than forcing step-by-step navigation.

What changes:

Interfaces become intent-first rather than page-first
Sites describe actions and data in machine-readable ways so agents can help
Results include sources, links, and artifacts you can reuse

Some projects describe this as an “agent-native” layer for the web, similar to how HTML+HTTP enabled browsers. If you want a concrete example, the NLWeb project itself frames the idea in relation to MCP (and mentions A2A as an emerging direction).

Implementation details (one example, not a standard): NLWeb is an open-source project that aims to simplify building conversational interfaces for websites. It describes using semi-structured formats (like Schema.org and RSS) as inputs, indexing content into a vector store for semantic retrieval, and exposing capabilities via MCP so AI clients can call tools against the site.

llms.txt

Like robots.txt for crawlers, llms.txt is a proposed convention for publishing an LLM-friendly index of a site. The idea is to put a markdown file at a predictable path (typically /llms.txt) that points to the most useful pages and documents, with a short summary and an optional section for “nice to have” links.

Spec and guidance: llms.txt
Example: GoFastMCP llms.txt

The bottom line: AI turns websites and data stores into conversational surfaces. By adding llms.txt and shipping semantic search (or at least clean, machine-readable structure plus stable URLs), you make your content easier for both people and agents to discover, cite, and reuse.

Table of Contents

Vectors and embeddings: How AI understands meaning

From embeddings to responses: The inference process

Neural networks and transformers

The foundation of learning

Transformers: A revolutionary architecture

Attention mechanism

Context windows and model parameters

Parameters and model capability

Context windows

Training data and knowledge cutoff

Practical implications: Balancing trade-offs

Alignment: making models follow principles

Constitutional AI

Fine-tuning a model

Function calling

Model Context Protocol (MCP)

Retrieval Augmented Generation (RAG)

Agents and agentic AI

Multi-agent solutions

Core principles

Coordination models

MCP and A2A in the architecture

When to adopt multi-agent designs

Design guidance

Scaling AI implementations

AI Center of Excellence (CCoE)

The AI-native web: NLWeb, llms.txt, and semantic search

Semantic search (and why it matters)

NLWeb (natural-language web)

llms.txt