Web Scraping Glossary

20 essential terms every scraper, marketer, and data engineer should know.

Web Scraping

Web scraping is the automated extraction of structured data from websites.

Actor (Apify)

An Apify Actor is a cloud-hosted, serverless scraping or automation program.

Headless Browser

A headless browser is a web browser that runs without a graphical user interface.

Proxy Rotation

Proxy rotation is the practice of cycling through multiple IP addresses when making scraping requests, so the target server sees different IPs instead of repeated requests from the same source.

Rate Limiting

Rate limiting is a server-side defense that restricts the number of requests a client can make within a given time window.

CAPTCHA

A CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) is a challenge-response test used to determine whether a user is human.

Anti-Bot Protection

Anti-bot protection refers to the suite of server-side and client-side defenses websites use to detect and block automated traffic.

Compute Units (Apify)

Compute Units (CUs) are Apify's billing metric for platform usage.

Structured Data

Structured data is information organized in a predefined format that machines can easily read and process.

JSONL (JSON Lines)

JSONL (JSON Lines) is a text format where each line is a valid, self-contained JSON object.

XPath

XPath is a query language for selecting nodes from an XML or HTML document tree.

CSS Selector

A CSS selector is a pattern used to select HTML elements by their tag, class, ID, attribute, or structural position.

HTML Parser

An HTML parser reads raw HTML text and converts it into a structured tree of nodes (the Document Object Model, or DOM) that programs can traverse and query.

RAG (Retrieval-Augmented Generation)

RAG is an AI architecture where a language model's responses are grounded by retrieving relevant documents from a knowledge base at inference time.

AI Agent

An AI agent is an autonomous AI system that can perceive its environment, make decisions, and take actions to accomplish goals — including browsing the web, running tools, and calling APIs.

MCP (Model Context Protocol)

MCP (Model Context Protocol) is an open standard developed by Anthropic that enables AI models to connect to external tools and data sources through a standardized interface.

Residential Proxy

A residential proxy routes internet traffic through real household IP addresses (assigned by ISPs to home users), making requests appear as ordinary human browsing.

JavaScript Rendering

JavaScript rendering refers to executing a webpage's JavaScript code to produce the final visible HTML.

XML Sitemap

An XML sitemap is a file that lists all the important URLs on a website, helping search engines discover and index pages.

Playwright

Playwright is an open-source browser automation library by Microsoft that supports Chromium, Firefox, and WebKit.