Use Cases

Web Scraping for AI: How to Build Training Datasets

Learn how to collect web data for AI and machine learning. Build training datasets, create RAG knowledge bases, and power AI agents with scraped data.

11 min read

Abstract digital data visualization representing AI and machine learning dataset pipelines

As an Apify affiliate, we may earn a commission from qualifying purchases made through our links, at no extra cost to you. We only recommend tools we believe in.

import BlogCTA from ’../../components/BlogCTA.astro’;

The AI revolution runs on data. Whether you’re fine-tuning LLMs, building RAG systems, or training custom models, the quality of your data determines the quality of your AI. In 2025, 65% of enterprises use web scraping for AI/ML projects, and this market segment is growing at 14.2% CAGR. Web scraping is the most scalable way to collect the training data you need.

This guide covers practical techniques for collecting, processing, and preparing web data for AI applications—updated with 2025 best practices for RAG, embeddings, and the latest LLM landscape (GPT-4o, Claude 3.5, Llama 3).

Why Web Data for AI?

The Data Advantage

  • ChatGPT was trained on hundreds of billions of web pages
  • Google’s AI leverages decades of indexed web content
  • Open-source models improve with diverse web datasets

Your AI is only as good as its training data. Web scraping gives you access to:

  • Domain-specific knowledge not in general models
  • Current information beyond training cutoffs
  • Specialized formats (reviews, Q&A, documentation)
  • Multilingual content for global applications

AI Use Cases for Web Data

Use CaseData NeededSource Examples
RAG knowledge basesDomain documentsBlogs, docs, wikis
Chatbot trainingQ&A pairsForums, FAQs, support
Sentiment analysisReviews, opinionsAmazon, Yelp, social
Content generationWriting examplesNews, blogs, articles
Named entity recognitionLabeled textDirectories, databases
Market intelligenceBusiness dataListings, profiles

Types of AI Training Data

1. Unstructured Text

Examples:

  • Blog posts and articles
  • Documentation and wikis
  • News articles
  • Social media posts

Best for:

  • LLM pre-training
  • RAG document stores
  • Content generation
  • Summarization

2. Structured Data

Examples:

  • Product catalogs
  • Business listings
  • Event databases
  • Person profiles

Best for:

  • Knowledge graphs
  • Entity extraction
  • Data augmentation
  • Factual grounding

3. Question-Answer Pairs

Examples:

  • FAQ pages
  • Stack Overflow
  • Quora answers
  • Reddit threads

Best for:

  • Chatbot training
  • Fine-tuning for Q&A
  • Instruction tuning
  • RLHF datasets

4. Labeled Data

Examples:

  • Review ratings (sentiment)
  • Category tags (classification)
  • Product attributes (extraction)

Best for:

  • Supervised learning
  • Classification models
  • Sentiment analysis
  • Custom extractors

Building a RAG Knowledge Base

Retrieval-Augmented Generation (RAG) is the fastest way to add custom knowledge to LLMs without fine-tuning.

Step 1: Identify Knowledge Sources

For a customer support chatbot:

Primary sources:
- Help documentation
- FAQ pages
- Product manuals
- Knowledge base articles

Secondary sources:
- Blog posts (how-to articles)
- Community forums
- Support ticket resolutions
- Release notes

Step 2: Extract Content

Use web scrapers to collect:

Data fields:
- Page URL (for citations)
- Title
- Full text content
- Last updated date
- Category/tags
- Related links

Scraper configuration:

Start URLs: https://docs.example.com
Crawl depth: 3
Content selector: main, article, .content
Exclude: nav, footer, sidebar
Output: JSON with metadata

Step 3: Clean and Chunk

Transform raw HTML into clean text:

Cleaning steps:

  1. Remove HTML tags
  2. Strip navigation/boilerplate
  3. Normalize whitespace
  4. Fix encoding issues
  5. Remove duplicates

Chunking strategy (2025 Best Practices):

The optimal chunk size has evolved significantly. Research in 2025 shows smaller chunks perform better:

Recommended chunk size: 128-512 tokens (256-300 words optimal)
Overlap: 10-20% of chunk size
Split on: Semantic boundaries (paragraphs, headers)
Preserve: Document metadata + contextual headers

Note: OpenAI’s default of 800 tokens with 400 overlap is now considered suboptimal. Semantic chunking with contextual headers (adding parent section titles to each chunk) significantly improves retrieval accuracy.

Step 4: Generate Embeddings

Convert chunks to vectors:

embeddings = openai.Embedding.create(
    input=chunks,
    model="text-embedding-3-small"
)

Step 5: Store in Vector Database

Popular options in 2025:

  • Pinecone - Managed, scalable, serverless options
  • Weaviate - Open-source, hybrid search, multimodal
  • ChromaDB - Simple, local, great for prototyping
  • Qdrant - High performance, Rust-based
  • Supabase pgvector - PostgreSQL-native, increasingly mainstream
  • Milvus - Enterprise-scale, NVIDIA partnership

Collecting Data for LLM Fine-Tuning

Instruction-Following Data

Format for chat models:

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant..."},
    {"role": "user", "content": "How do I reset my password?"},
    {"role": "assistant", "content": "To reset your password..."}
  ]
}

Sources to scrape:

  • Customer support transcripts
  • FAQ pages (convert to Q&A)
  • Forum discussions
  • Product documentation

Domain Adaptation Data

For specialized domains, collect:

Medical:

  • PubMed abstracts
  • Medical encyclopedias
  • Drug databases
  • Clinical guidelines

Legal:

  • Case law databases
  • Legal encyclopedias
  • Contract examples
  • Regulatory documents

Technical:

  • GitHub repositories
  • Stack Overflow
  • Documentation sites
  • Technical blogs

Quality Filters

Apply filters to ensure data quality:

quality_checks = {
    "min_length": 100,  # tokens
    "max_length": 4096,
    "language": "en",
    "no_profanity": True,
    "no_personal_info": True,
    "deduplicated": True
}

Data for AI Agents

AI agents need real-time access to current information. Web scraping provides:

1. Real-Time Information

Example: Travel agent

Scraped data:
- Flight prices (hourly updates)
- Hotel availability
- Weather forecasts
- Travel advisories
- Local events

2. Business Intelligence

Example: Sales agent

Scraped data:
- Company profiles
- Key personnel
- Recent news
- Technology stack
- Funding history

3. Market Data

Example: Trading agent

Scraped data:
- News headlines
- Social sentiment
- Competitor actions
- Regulatory filings
- Industry reports

Step-by-Step: Building a Training Dataset

Example: E-commerce Product Q&A Dataset

Goal: Train a model to answer product questions

Step 1: Scrape product pages

Source: Amazon product listings
Data: Title, description, specifications
Volume: 10,000 products

Step 2: Scrape Q&A sections

Source: Amazon Q&A
Data: Question, answer, votes
Volume: 50,000 Q&A pairs

Step 3: Scrape reviews

Source: Amazon reviews
Data: Review text, rating, helpfulness
Volume: 100,000 reviews

Step 4: Create training pairs

training_data = []
for product in products:
    for qa in product.questions:
        training_data.append({
            "context": product.description,
            "question": qa.question,
            "answer": qa.answer
        })

Step 5: Quality filtering

  • Remove short answers (<20 chars)
  • Filter by helpfulness votes
  • Deduplicate similar Q&As
  • Validate answer relevance

Step 6: Format for training

{
  "instruction": "Answer the question based on the product info.",
  "input": "Product: [description]\nQuestion: [question]",
  "output": "[answer]"
}

Data Processing Pipeline

[Web Sources] 
    ↓ (Scrapers)
[Raw Data Storage]
    ↓ (Cleaning)
[Cleaned Data]
    ↓ (Processing)
[Training Format]
    ↓ (Quality Checks)
[Final Dataset]

Tools for Each Stage

StageTools
ExtractionApify, Scrapy, Playwright
StorageS3, GCS, Azure Blob
CleaningBeautifulSoup, regex, spaCy
ProcessingPandas, Spark, Dask
QualityLangchain, custom validators
TrainingHugging Face, PyTorch

Respecting Robots.txt

User-agent: *
Disallow: /private/
Allow: /public/

Terms of Service

  • Read ToS before scraping
  • Avoid scraping login-required content
  • Don’t overload servers
  • Cache aggressively

Data Rights

Generally OK:

  • Factual information
  • Public domain content
  • Licensed content (with attribution)
  • Aggregated statistics

Avoid:

  • Copyrighted content (for redistribution)
  • Personal data (without consent)
  • Paywalled content
  • Private communications

AI-Specific Guidelines

  • Transparency - Document your data sources
  • Bias mitigation - Diverse sources reduce bias
  • Privacy - Remove PII from training data
  • Attribution - Credit sources when possible

Best Practices for AI Data Collection

1. Diversity

Collect from multiple sources to reduce bias:

  • Different websites
  • Various authors
  • Multiple perspectives
  • Geographic diversity

2. Recency

For current knowledge:

  • Scrape regularly (weekly/daily)
  • Include timestamps
  • Prioritize recent content
  • Expire old data

3. Quality Over Quantity

Good dataset:
- 10,000 high-quality examples
- Curated and validated
- Diverse and representative

Better than:
- 1,000,000 noisy examples
- Uncleaned raw scrapes
- Duplicate content

4. Documentation

Track your dataset metadata:

{
  "name": "Product QA Dataset v1",
  "sources": ["amazon.com"],
  "collection_date": "2024-12",
  "size": 50000,
  "format": "JSONL",
  "fields": ["context", "question", "answer"],
  "license": "Internal use only",
  "quality_checks": ["length", "dedup", "relevance"]
}

Getting Started

Quick Start for RAG

  1. Identify your knowledge sources
  2. Use Website Content Crawler to extract content
  3. Clean and chunk the text
  4. Generate embeddings
  5. Store in vector database
  6. Query with your LLM

Quick Start for Fine-Tuning

  1. Define your task (Q&A, classification, etc.)
  2. Find sources with relevant examples
  3. Extract and structure the data
  4. Format for your training framework
  5. Validate data quality
  6. Start with small experiments

Need help building AI training datasets? Contact us for custom data collection solutions.

Share this:

Tags

#ai #machine learning #training data #llm #rag #dataset
✍️

ParseFlow

Automation Expert & Technical Founder

Specializing in web scraping, browser automation, and data harvesting solutions. Helping businesses scale with automated insights.