cd ..>How Web Scraping Works: A Deep Dive Using Node.js and Cheerio
May 19, 2025 Comments

How Web Scraping Works: A Deep Dive Using Node.js and Cheerio

Learn how to scrape structured data from websites using Node.js and Cheerio, with real examples from an Internshala Automation project.
How Web Scraping Works: A Deep Dive Using Node.js and Cheerio

๐Ÿ•ธ๏ธ What is Web Scraping?

Web scraping is the process of automatically extracting data from websites. It's useful when the data you need is not available via an API. With tools like Cheerio in Node.js, you can parse and navigate HTML just like jQuery.

In this post, we'll explore:

  • How web scraping works
  • Traversing the DOM using Cheerio
  • Handling anti-scraping techniques
  • Ethical considerations
  • Real code from an Internshala Automation project

โš™๏ธ Tools Weโ€™ll Use

  • Node.js โ€“ Server-side JavaScript
  • Axios โ€“ For making HTTP requests
  • Cheerio โ€“ For parsing and traversing HTML
  • Internshala โ€“ The target site (educational purposes only)

๐Ÿ“„ Installing Dependencies

npm install axios cheerio

๐Ÿ” Scraping Internshala โ€“ The Basics

Hereโ€™s a sample scraping function to fetch internship listings from Internshala.

const axios = require("axios");
const cheerio = require("cheerio");

const scrapeInternships = async (keywords = "web development") => {
  const url = `https://internshala.com/internships/keywords-${keywords.replace(
    " ",
    "-"
  )}`;

  const { data } = await axios.get(url);
  const $ = cheerio.load(data);

  const internships = [];

  $(".internship_meta").each((i, el) => {
    const title = $(el).find(".profile").text().trim();
    const company = $(el).find(".company_name").text().trim();
    const location = $(el).find(".location_link").text().trim();
    const startDate = $(el).find(".start_immediately_desktop").text().trim();

    internships.push({ title, company, location, startDate });
  });

  return internships;
};

scrapeInternships().then(console.log).catch(console.error);

๐Ÿงญ How DOM Traversal Works

Using Cheerio, you can traverse HTML using jQuery-like selectors.

Example:

const title = $(el).find(".profile").text().trim();

This finds the .profile class inside each internship block and extracts the title.


๐Ÿ›ก๏ธ Anti-Scraping Protections

Websites often try to block scrapers. Hereโ€™s how to handle common techniques:

1. User-Agent Blocking

Some sites block requests that donโ€™t have a real browser User-Agent.

const { data } = await axios.get(url, {
  headers: {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)",
  },
});

2. Rate Limiting

Avoid sending too many requests too quickly. Use setTimeout or libraries like p-queue.

3. JavaScript-Rendered Content

If the site loads data via JavaScript, tools like Cheerio wonโ€™t work. Youโ€™ll need Puppeteer or Playwright for such cases.


โœ… Ethical Web Scraping

While scraping is powerful, it comes with responsibilities:

  • Respect robots.txt โ€“ Check if scraping is allowed
  • Donโ€™t overload servers โ€“ Use rate limits
  • Use scraped data only for personal or educational use
  • Avoid login-based or paid content without permission

๐Ÿงช Real Use Case โ€“ Automating Internshala

In our Internshala Automation Project, we scrape internship data and filter based on user preferences like location, skills, and start date.

Hereโ€™s a simplified snippet:

const filterByLocation = (data, location) =>
  data.filter((job) =>
    job.location.toLowerCase().includes(location.toLowerCase())
  );

const filtered = filterByLocation(await scrapeInternships(), "remote");
console.log(filtered);

๐Ÿ“ฆ Wrapping Up

Web scraping using Node.js and Cheerio is a practical skill that opens up automation and data collection possibilities. Whether you're tracking job listings or extracting prices from e-commerce sites, the fundamentals remain the same.

โœ… Keep it ethical โœ… Avoid scraping sensitive or protected data โœ… Respect site policies


๐Ÿ“š Resources


Happy Scraping! ๐Ÿš€