Web Scraping Glossary
20 essential terms every scraper, marketer, and data engineer should know.
Web Scraping
Web scraping is the automated extraction of structured data from websites.
Actor (Apify)
An Apify Actor is a cloud-hosted, serverless scraping or automation program.
Headless Browser
A headless browser is a web browser that runs without a graphical user interface.
Proxy Rotation
Proxy rotation is the practice of cycling through multiple IP addresses when making scraping requests, so the target server sees different IPs instead of repeated requests from the same source.
Rate Limiting
Rate limiting is a server-side defense that restricts the number of requests a client can make within a given time window.
CAPTCHA
A CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) is a challenge-response test used to determine whether a user is human.
Anti-Bot Protection
Anti-bot protection refers to the suite of server-side and client-side defenses websites use to detect and block automated traffic.
Compute Units (Apify)
Compute Units (CUs) are Apify's billing metric for platform usage.
Structured Data
Structured data is information organized in a predefined format that machines can easily read and process.
JSONL (JSON Lines)
JSONL (JSON Lines) is a text format where each line is a valid, self-contained JSON object.
XPath
XPath is a query language for selecting nodes from an XML or HTML document tree.
CSS Selector
A CSS selector is a pattern used to select HTML elements by their tag, class, ID, attribute, or structural position.
HTML Parser
An HTML parser reads raw HTML text and converts it into a structured tree of nodes (the Document Object Model, or DOM) that programs can traverse and query.
RAG (Retrieval-Augmented Generation)
RAG is an AI architecture where a language model's responses are grounded by retrieving relevant documents from a knowledge base at inference time.
AI Agent
An AI agent is an autonomous AI system that can perceive its environment, make decisions, and take actions to accomplish goals — including browsing the web, running tools, and calling APIs.
MCP (Model Context Protocol)
MCP (Model Context Protocol) is an open standard developed by Anthropic that enables AI models to connect to external tools and data sources through a standardized interface.
Residential Proxy
A residential proxy routes internet traffic through real household IP addresses (assigned by ISPs to home users), making requests appear as ordinary human browsing.
JavaScript Rendering
JavaScript rendering refers to executing a webpage's JavaScript code to produce the final visible HTML.
XML Sitemap
An XML sitemap is a file that lists all the important URLs on a website, helping search engines discover and index pages.
Playwright
Playwright is an open-source browser automation library by Microsoft that supports Chromium, Firefox, and WebKit.