Web Scraping with Cheerio

2023-04-03
By: O. Wolfson

In this tutorial we'll be fetching quotes by Rodney Dangerfield from the AZQuotes website as an example.

Prerequisites:

  1. Basic knowledge of JavaScript and Node.js
  2. Node.js installed on your computer

Step 1: Install required packages

First, create a new folder for your project and navigate to it using the command line. Then, run the following command to initialize a new Node.js project:

bash
npm init -y

Next, install Axios and Cheerio using the following command:

bash
npm install axios cheerio

Step 2: Create a web scraping script

Create a new file called scrape_quotes.js and paste the code below into the file.

javascript
// Import required libraries
const axios = require("axios"); // Used for making HTTP requests
const cheerio = require("cheerio"); // Used for parsing HTML and manipulating DOM elements
const fs = require("fs"); // Used for working with the file system

// Set the base URL for the quotes website
const baseURL = "https://www.azquotes.com/author/3621-Rodney_Dangerfield";

// Define an async function to fetch quotes from a single page
const fetchQuotes = async (url) => {
  try {
    // Make an HTTP GET request to the specified URL
    const response = await axios.get(url);
    // Get the HTML content from the response
    const html = response.data;
    // Load the HTML content into Cheerio
    const $ = cheerio.load(html);
    // Select all 'a.title' elements from the HTML
    const quoteElements = $("a.title");
    // Initialize an empty array to store quotes
    const quotes = [];

    // Iterate through each quote element found
    quoteElements.each((index, element) => {
      // Extract the quote text and trim any whitespace
      const quoteText = $(element).text().trim();
      // Extract the 'data-author' attribute and trim any whitespace
      const author = $(element).attr("data-author").trim();

      // Check if the author is Rodney Dangerfield
      if (author === "Rodney Dangerfield") {
        // Add the quote to the quotes array
        quotes.push({ author, quote: quoteText });
      }
    });

    // Return the quotes array
    return quotes;
  } catch (error) {
    // Log any errors and return an empty array
    console.error(`Error fetching the website: ${error}`);
    return [];
  }
};

// Define an async function to fetch quotes from multiple pages
const fetchAllQuotes = async (baseURL, totalPages) => {
  // Initialize an empty array to store all quotes
  const allQuotes = [];

  // Iterate through each page number from 1 to totalPages
  for (let pageNumber = 1; pageNumber <= totalPages; pageNumber++) {
    // Construct the URL for the current page
    const url = `${baseURL}?p=${pageNumber}`;
    // Log the URL being fetched
    console.log(`Fetching quotes from: ${url}`);
    // Fetch quotes from the current page using the fetchQuotes function
    const quotes = await fetchQuotes(url);
    // Add the fetched quotes to the allQuotes array
    allQuotes.push(...quotes);
  }

  // Return the allQuotes array
  return allQuotes;
};

// Define a function to save quotes to a JSON file
const saveQuotesToFile = (quotes, fileName) => {
  // Convert the quotes array to a formatted JSON string
  const jsonData = JSON.stringify(quotes, null, 2);
  // Write the JSON string to the specified file
  fs.writeFileSync(fileName, jsonData);
  // Log a message indicating the quotes were saved to the file
  console.log(`Quotes saved to ${fileName}`);
};

// Define the main async function
const main = async () => {
  // Set the total number of pages to fetch
  const totalPages = 2; // Change this to the total number of pages you want to fetch
  // Fetch all quotes using the fetchAllQuotes function
  const allQuotes = await fetchAllQuotes(baseURL, totalPages);
  // Save the fetched quotes to the 'quotes.json' file
  saveQuotesToFile(allQuotes, "quotes.json");
};

// Call the main function
main();

The code consists of several functions:

fetchQuotes(url): An async function that fetches quotes from a single page, tailored to the specific structure of the target website. It uses Axios to fetch the HTML content of the page and Cheerio to parse the HTML and extract the quotes. To create a function like this, one needs to investigate the particular HTML structure and elements' attributes of the site being scraped using a web browser's developer tools to inspect the page. This function relies on that knowledge to use jQuery-like selectors effectively, targeting and extracting the desired information from the web page.

fetchAllQuotes(baseURL, totalPages): An async function that iterates through multiple pages and fetches quotes from each page using the fetchQuotes function.

saveQuotesToFile(quotes, fileName): A function that saves the quotes array to a JSON file with the specified file name.

main(): An async main function that fetches quotes from the AZQuotes website and saves them to a JSON file named quotes.json.

Step 3: Run the web scraping script

Run the web scraping script using the following command:

bash
node scrape_quotes.js

After the script finishes running, you should see a quotes.json file created in your project folder with the scraped quotes.

Step 4: Load quotes from the JSON file (optional)

If you want to load the saved quotes from the JSON file into a JavaScript object later, you can create another script called load_quotes.js with the following code:

javascript
const fs = require("fs");

const loadQuotesFromFile = (fileName) => {
  const jsonData = fs.readFileSync(fileName, "utf-8");
  return JSON.parse(jsonData);
};

const main = () => {
  const quotes = loadQuotesFromFile("quotes.json");
  console.log(quotes);
};

main();

This script defines a loadQuotesFromFile function that reads the JSON data from the specified file and parses it into a JavaScript object. The main function calls this function to load the quotes and logs them to the console.

Run the script with the command:

bash
node load_quotes.js

This should output the quotes as a JavaScript object in the console.

That's it! You've successfully created a web scraping script to fetch quotes using Cheerio and Node.js, and saved the quotes to a JSON file for later use.