AI Coding Tools Glossary

84 terms defined. An authoritative reference for AI Coding Tools.

A B C D E F G H I JKL MNO P Q R S TUVWXYZ

A

Agent Mode

A workflow where an AI assistant autonomously plans and executes multi-step tasks—reading files, running tests, making edits—with minimal human intervention. Agent mode goes beyond single completions to handle complex refactors.

Agentic Coding

An AI paradigm where the tool autonomously plans, executes, and iterates on coding tasks — reading files, running commands, fixing errors, and testing results in a loop. Examples: Claude Code, Cursor Composer agent mode. Represents the evolution from suggestion-based to autonomous AI assistance.

AI Code Review

The use of an AI assistant to automatically analyze pull requests or diffs for bugs, style violations, security issues, and logic errors. AI code review accelerates feedback cycles and catches issues before human reviewers.

AI Pair Programming

A development style where an AI assistant acts as the second programmer in a pair, offering real-time suggestions, explanations, and corrections. AI pair programming increases developer velocity and reduces context-switching.

AI PR Review

An automated process where an AI assistant analyzes a pull request diff and posts comments about bugs, style, security, and test coverage. AI PR review complements human reviewers and speeds up code quality feedback loops.

AI Test Generation

Using an AI model to automatically write unit, integration, or end-to-end tests for existing code. AI test generation increases coverage and reduces the manual effort required to write thorough test suites.

Air-Gapped Deployment

An AI installation with no network connectivity to external services, used in high-security environments. Air-gapped deployments require pre-downloaded model weights and prevent any data from leaving the facility.

API Key Management

The practices for securely storing, rotating, and scoping API keys used to authenticate with AI model providers. Poor API key management is a leading cause of unexpected billing charges and data exposure in AI-powered apps.

Attention Mechanism

The core operation in transformer models that lets each token attend to all other tokens in the context. Attention enables an AI coding tool to relate a variable declaration on line 1 to its usage on line 200.

B

Beam Search

A decoding algorithm that maintains multiple candidate sequences simultaneously and selects the highest-probability complete output. Beam search is used in batch code generation tasks where quality matters more than speed.

BPE (Byte Pair Encoding)

A tokenization algorithm that iteratively merges the most frequent character pairs into single tokens. BPE is widely used in code models because it efficiently represents programming keywords, operators, and identifiers.

C

Chat Mode

An interaction style where developers converse with an AI assistant in a back-and-forth dialogue to ask questions, debug code, or design solutions. Chat mode complements inline completion by handling open-ended queries.

CI/CD AI Integration

Incorporating AI-powered checks—such as automated code review, test generation, or security scanning—into a continuous integration and delivery pipeline. CI/CD AI integration catches issues automatically on every pull request.

Code Completion

An AI feature that predicts and suggests the next lines of code as you type. Modern tools use large language models to suggest entire functions, not just variable names. Accuracy depends on context quality — more open files and comments improve suggestions.

Code LLM

A large language model specifically trained or fine-tuned for programming tasks. Examples: CodeLlama, DeepSeek Coder, StarCoder, Codex. These models understand syntax, APIs, and programming patterns. They power the AI features in code editors and copilot tools.

Codebase Indexing

The process of scanning, parsing, and creating searchable representations of your entire project. Enables AI to answer questions about code it hasn't directly seen. Tools index file structure, function signatures, imports, and semantic content.

Context Length (Local Models)

The maximum number of tokens a locally run model can process in one request, which is often smaller than hosted models. Longer context lengths require proportionally more RAM and slow inference significantly.

Context Length Limit

The hard upper bound on how many tokens can be in a single model request, including the prompt and the generated output. Exceeding the limit requires truncation or summarization strategies to preserve key context.

Context Window

The maximum amount of text (measured in tokens) an AI model can process in a single request. Larger windows allow the AI to see more of your codebase. GPT-4o: 128K tokens. Claude: 200K tokens. Critical for understanding cross-file dependencies and large codebases.

Copilot

A general term (popularized by GitHub Copilot) for an AI assistant embedded in an IDE that suggests code, explains errors, and answers questions. Multiple vendors now offer copilot-style tools with varying model backends.

CPU Inference

Running LLM inference on a CPU rather than a GPU, typically 5–20x slower but accessible without specialized hardware. Tools like llama.cpp make CPU inference practical for smaller models on everyday laptops.

D

Data Exfiltration Risk

The danger that sensitive code, secrets, or business logic sent to a cloud AI service could be accessed by unauthorized parties. Enterprises often require on-premise or zero-retention deployments to mitigate this risk.

E

Embeddings

Numerical vector representations of text that capture semantic meaning. Used to find similar code, match queries to relevant files, and power codebase search. Tools like Cursor and Cody create embeddings of your entire codebase for intelligent retrieval.

F

Fine-Tuning

Training an existing AI model on specialized data to improve performance for specific tasks. Code-specific models (CodeLlama, StarCoder) are fine-tuned on programming data. Custom fine-tuning on your codebase is emerging but currently expensive and complex.

Foundation Model

A large-scale AI model pre-trained on broad data that serves as a base for many downstream tasks. Foundation models are adapted for code generation, chat, and tool use with minimal additional training.

Function Calling

A capability that lets a model invoke predefined functions or tools (e.g., run a shell command, query a database) and incorporate their results into its response. Function calling powers agentic coding workflows.

G

GGUF Format

A binary file format for storing quantized LLM weights, used by llama.cpp and compatible runtimes. GGUF replaced the older GGML format and is the standard for sharing locally runnable open-weight models.

Ghost Text

The dimmed, preview text that appears in the editor showing an AI-generated code suggestion before you accept it. Pressing Tab inserts the suggestion; pressing Escape dismisses it. The term comes from VS Code's rendering API used by Copilot and similar extensions.

GPU VRAM Requirements

The amount of video RAM needed to load a model's weights into a GPU for fast inference. As a rule of thumb, an unquantized model needs roughly 2 bytes of VRAM per parameter, so a 7B model requires ~14 GB at FP16.

Grounding

The practice of anchoring model responses to specific, verified sources such as codebase files, documentation, or test results. Grounding reduces hallucinations by giving the model authoritative context to cite.

Guardrails

Safety and quality constraints applied to AI outputs to prevent harmful, insecure, or off-topic responses. Coding tools use guardrails to block generation of malware, credential leaks, or license-incompatible code.

H

Hallucination

When an AI model generates plausible-looking but incorrect code — referencing APIs that don't exist, inventing function signatures, or producing logic that compiles but doesn't work correctly. More common with obscure libraries. Always verify AI suggestions against official documentation.

Hallucination

When a model confidently generates plausible-sounding but factually incorrect output, such as inventing a non-existent API method. Hallucinations in AI coding tools can introduce subtle bugs that pass code review.

Hardware Requirements for LLMs

The CPU, RAM, GPU, and storage specs needed to run a local LLM at acceptable speed. Requirements scale with model size; a 7B model runs on most modern laptops, while a 70B model typically needs a high-end GPU or multi-GPU server.

HumanEval Benchmark

A dataset of 164 hand-written Python programming problems used to evaluate code generation models. HumanEval measures functional correctness by running generated code against unit tests.

I

IDE Plugin / Extension

A software add-on that embeds an AI coding assistant directly into a developer's editor (VS Code, JetBrains, Neovim, etc.). IDE plugins surface inline completions, chat panels, and code actions without leaving the coding environment.

Inference

The process of running a trained model to generate predictions or completions on new input. Inference speed—measured in tokens per second—directly affects how responsive an AI coding tool feels during use.

Inline Suggestion

Code completion that appears as ghost text directly in the editor at the cursor position. Accepted with Tab, dismissed with Escape. The primary interaction model for tools like GitHub Copilot. Quality varies from single tokens to multi-line blocks.

INT4 Quantization

A quantization level that represents model weights as 4-bit integers, dramatically reducing VRAM and RAM usage. INT4 allows models like Llama 3 70B to run on a single consumer GPU with modest quality trade-offs.

INT8 Quantization

A quantization level that represents weights as 8-bit integers, balancing memory savings with output quality. INT8 is often used when INT4 degrades accuracy too much for a given coding task.

J

Jailbreak

An attempt to bypass an AI model's safety restrictions through carefully crafted prompts. Responsible AI coding tool providers continuously update safeguards to close newly discovered jailbreak techniques.

L

Large Language Model

A deep learning model trained on vast text corpora to understand and generate human language. LLMs like GPT-4 and Claude power AI coding tools by predicting the most useful next token given a prompt and context.

Latency

The delay between submitting a prompt and receiving the first token of the model's response. Low latency is critical for inline code completion, where delays longer than ~100 ms disrupt developer flow.

llama.cpp

A C/C++ inference engine for running LLMs efficiently on commodity hardware without a GPU. It powers many local AI tools and supports GGUF models with CPU and GPU offloading options.

Local LLM

A large language model that runs entirely on your own machine rather than a remote API. Local LLMs offer offline access, zero latency costs, and complete data privacy for AI-assisted coding.

LSP Integration

Connecting an AI coding tool to the Language Server Protocol so it can access real-time diagnostics, symbol information, and go-to-definition data from the IDE. LSP integration makes AI suggestions more context-aware and accurate.

M

MCP (Model Context Protocol)

An open standard that defines how AI models communicate with external tools, data sources, and services. MCP enables coding assistants to read files, run terminals, and query databases through a unified protocol.

Model Versioning

The practice of pinning integrations to a specific model version (e.g., gpt-4o-2024-05-13) to avoid unexpected behavior changes when providers release updates. Model versioning is essential for reproducible CI/CD pipelines.

Multi-File Editing

AI-assisted code changes that span multiple files simultaneously — renaming across a codebase, refactoring shared interfaces, or implementing a feature that touches several modules. A key differentiator of full AI editors (Cursor, Windsurf) versus simple copilot plugins.

O

Ollama

An open-source tool that makes it easy to download and run large language models locally on macOS, Linux, or Windows. Ollama manages model downloads, serves a local API, and supports popular open-weight models like Llama and Mistral.

On-Premise AI Deployment

Running an AI model entirely within an organization's own infrastructure rather than sending requests to a cloud provider. On-premise deployment gives maximum control over data privacy and latency.

Open-Weights Model

An AI model whose trained parameters are publicly released, allowing anyone to download, run, and modify it. Open-weights models like Llama, Mistral, and DeepSeek Coder are popular choices for local AI coding setups.

P

Pass@K Benchmark

An evaluation metric that measures the probability a model generates at least one correct solution within K attempts for a coding problem. Pass@1 tests whether the first suggestion is correct; higher K values measure coverage.

PII in Prompts

Personally identifiable information accidentally included in prompts sent to AI coding tools, such as API keys, email addresses, or user data embedded in code. PII leakage can violate privacy regulations and expose users.

Pre-Training

The initial phase of training a model on a massive dataset to learn general language and code patterns. Pre-training is computationally expensive but produces a versatile base model ready for fine-tuning.

Privacy Mode

A setting in AI coding tools that disables telemetry, code snippet uploads, and usage logging. Privacy mode is typically required by organizations with strict data governance policies.

Prompt Engineering

The practice of crafting effective instructions for AI models. In coding contexts: being specific about language, framework, patterns, and constraints. Good prompts include example inputs/outputs and reference existing code patterns. A high-leverage skill for maximizing AI tool value.

Prompt Injection

An attack where malicious content in user-controlled data manipulates an AI model's instructions, causing unintended behavior. In coding tools, prompt injection can appear in source files or comments read by the assistant.

Q

Quantization

The process of reducing model weight precision (e.g., from 16-bit floats to 4-bit integers) to shrink memory usage and speed up inference. Quantization makes large models runnable on consumer hardware with acceptable quality loss.

R

RAG (Retrieval-Augmented Generation)

A technique that supplements a model's response by first retrieving relevant documents or code from an external store, then generating an answer grounded in that retrieved context. RAG reduces hallucinations in large codebases.

Rate Limiting

Restrictions imposed by AI providers on how many requests or tokens a user can consume per minute or day. Rate limits require caching, queuing, or tier upgrades to ensure smooth operation in production coding tools.

Retrieval-Augmented Generation (RAG)

A technique that retrieves relevant documents or code from a knowledge base and includes them in the AI's context before generating a response. Used by AI code editors to pull in relevant files from your codebase. Improves accuracy by grounding responses in actual code.

RLHF (Reinforcement Learning from Human Feedback)

A training technique where human raters score model outputs and those scores guide further training via reinforcement learning. RLHF aligns AI coding assistants with developer preferences and reduces harmful or incorrect suggestions.

S

Self-Hosted AI

Deploying an AI model on infrastructure you control—a personal server, VPS, or private cloud—rather than using a vendor's managed API. Self-hosted AI gives full control over costs, data, and model choice.

Streaming Tokens

A delivery mode where model output tokens are sent to the client incrementally as they are generated rather than all at once. Streaming makes AI coding tools feel faster and allows users to interrupt unhelpful responses early.

SWE-bench

A benchmark of real GitHub issues from popular Python repositories used to evaluate AI agents on software engineering tasks. SWE-bench tests whether a model can understand a bug report and produce a correct patch.

System Prompt

Hidden instructions that configure how an AI coding tool behaves. Defines personality, capabilities, coding style preferences, and safety guidelines. AI code editors use system prompts to specialize the base model for software development tasks.

T

Tab Completion

The action of pressing Tab (or a configured key) to accept an AI-generated inline suggestion. Tab completion workflows let developers move quickly through boilerplate while staying in their editor flow.

Telemetry

Usage data automatically collected by AI tools, such as accepted suggestions, latency metrics, and error rates. Developers should review telemetry settings to understand what data is shared with the vendor.

Temperature

A sampling parameter that controls how random the model's output is. Lower values (0.0–0.3) produce deterministic, focused code completions; higher values encourage more creative or varied suggestions.

Throughput

The total number of tokens a system can process per unit of time across all users or requests. High throughput matters for teams sharing a self-hosted AI service or enterprise deployments with many concurrent developers.

Token

The basic unit of text that AI models process. Roughly 1 token = 0.75 words or 4 characters in English. Code is less token-efficient than prose due to special characters and formatting. Pricing, context limits, and response times are all measured in tokens.

Token Budget

The planned allocation of tokens between system instructions, retrieved context, conversation history, and expected output within a model's context window. Managing the token budget prevents truncation and controls API costs.

Token Limit

The maximum number of tokens a model can process in a single request, covering both input and output. Hitting the token limit truncates context, which can cause an AI assistant to lose track of earlier code.

Tokenization

The process of splitting raw text or code into discrete units (tokens) before feeding them to a model. How a tokenizer splits code affects costs, context limits, and how well the model handles identifiers and symbols.

Tokens per Second

A benchmark measuring how many tokens an AI model generates each second during inference. Higher tokens-per-second means a more responsive coding assistant; typical consumer GPU setups achieve 30–120 t/s for 7B models.

Tool Use

The ability of an AI model to call external tools—web search, code execution, file I/O—during inference. Tool use extends an assistant beyond text generation to take real actions inside a development environment.

Top-K Sampling

A decoding method that restricts token selection to the K most likely next tokens at each step. It is often used alongside top-p to reduce nonsensical outputs in code generation.

Top-P (Nucleus) Sampling

A decoding strategy that limits token selection to the smallest set whose cumulative probability exceeds a threshold p. Top-P sampling helps AI tools balance diversity and coherence in generated code.

Training Data

The corpus of text and code used to teach a model its capabilities. The composition of training data heavily influences which languages, frameworks, and patterns an AI coding tool handles best.

Training Data Opt-Out

A provider option that prevents user prompts and completions from being used to train or improve future model versions. Many AI coding tool vendors offer opt-out settings for privacy-conscious users.

Transformer Architecture

The neural network design underpinning nearly all modern LLMs, using stacked self-attention layers and feed-forward networks. Almost every AI coding assistant—from GitHub Copilot to Claude—is built on a transformer.

V

Vector Database

A database optimized for storing and searching high-dimensional embedding vectors. AI coding tools use vector databases to retrieve semantically relevant code snippets, documentation, or past conversations during RAG retrieval.

Z

Zero Data Retention

A service policy where the AI provider does not store prompts or completions after the request completes. Zero data retention is a key requirement for enterprise customers handling sensitive intellectual property.