| 2023-04-03

Web Scraping with Cheerio

    In this tutorial we'll be fetching quotes by Rodney Dangerfield from the AZQuotes website as an example.

    Prerequisites:

    1. Basic knowledge of JavaScript and Node.js
    2. Node.js installed on your computer

    Step 1: Install required packages

    First, create a new folder for your project and navigate to it using the command line. Then, run the following command to initialize a new Node.js project:

    npm init -y
    
    bash

    Next, install Axios and Cheerio using the following command:

    npm install axios cheerio
    
    bash

    Step 2: Create a web scraping script

    Create a new file called scrape_quotes.js and paste the code below into the file.

    // Import required libraries
    const axios = require("axios"); // Used for making HTTP requests
    const cheerio = require("cheerio"); // Used for parsing HTML and manipulating DOM elements
    const fs = require("fs"); // Used for working with the file system
    
    // Set the base URL for the quotes website
    const baseURL = "https://www.azquotes.com/author/3621-Rodney_Dangerfield";
    
    // Define an async function to fetch quotes from a single page
    const fetchQuotes = async (url) => {
      try {
        // Make an HTTP GET request to the specified URL
        const response = await axios.get(url);
        // Get the HTML content from the response
        const html = response.data;
        // Load the HTML content into Cheerio
        const $ = cheerio.load(html);
        // Select all 'a.title' elements from the HTML
        const quoteElements = $("a.title");
        // Initialize an empty array to store quotes
        const quotes = [];
    
        // Iterate through each quote element found
        quoteElements.each((index, element) => {
          // Extract the quote text and trim any whitespace
          const quoteText = $(element).text().trim();
          // Extract the 'data-author' attribute and trim any whitespace
          const author = $(element).attr("data-author").trim();
    
          // Check if the author is Rodney Dangerfield
          if (author === "Rodney Dangerfield") {
            // Add the quote to the quotes array
            quotes.push({ author, quote: quoteText });
          }
        });
    
        // Return the quotes array
        return quotes;
      } catch (error) {
        // Log any errors and return an empty array
        console.error(`Error fetching the website: ${error}`);
        return [];
      }
    };
    
    // Define an async function to fetch quotes from multiple pages
    const fetchAllQuotes = async (baseURL, totalPages) => {
      // Initialize an empty array to store all quotes
      const allQuotes = [];
    
      // Iterate through each page number from 1 to totalPages
      for (let pageNumber = 1; pageNumber <= totalPages; pageNumber++) {
        // Construct the URL for the current page
        const url = `${baseURL}?p=${pageNumber}`;
        // Log the URL being fetched
        console.log(`Fetching quotes from: ${url}`);
        // Fetch quotes from the current page using the fetchQuotes function
        const quotes = await fetchQuotes(url);
        // Add the fetched quotes to the allQuotes array
        allQuotes.push(...quotes);
      }
    
      // Return the allQuotes array
      return allQuotes;
    };
    
    // Define a function to save quotes to a JSON file
    const saveQuotesToFile = (quotes, fileName) => {
      // Convert the quotes array to a formatted JSON string
      const jsonData = JSON.stringify(quotes, null, 2);
      // Write the JSON string to the specified file
      fs.writeFileSync(fileName, jsonData);
      // Log a message indicating the quotes were saved to the file
      console.log(`Quotes saved to ${fileName}`);
    };
    
    // Define the main async function
    const main = async () => {
      // Set the total number of pages to fetch
      const totalPages = 2; // Change this to the total number of pages you want to fetch
      // Fetch all quotes using the fetchAllQuotes function
      const allQuotes = await fetchAllQuotes(baseURL, totalPages);
      // Save the fetched quotes to the 'quotes.json' file
      saveQuotesToFile(allQuotes, "quotes.json");
    };
    
    // Call the main function
    main();
    
    javascript

    The code consists of several functions:

    fetchQuotes(url): An async function that fetches quotes from a single page, tailored to the specific structure of the target website. It uses Axios to fetch the HTML content of the page and Cheerio to parse the HTML and extract the quotes. To create a function like this, one needs to investigate the particular HTML structure and elements' attributes of the site being scraped using a web browser's developer tools to inspect the page. This function relies on that knowledge to use jQuery-like selectors effectively, targeting and extracting the desired information from the web page.

    fetchAllQuotes(baseURL, totalPages): An async function that iterates through multiple pages and fetches quotes from each page using the fetchQuotes function.

    saveQuotesToFile(quotes, fileName): A function that saves the quotes array to a JSON file with the specified file name.

    main(): An async main function that fetches quotes from the AZQuotes website and saves them to a JSON file named quotes.json.

    Step 3: Run the web scraping script

    Run the web scraping script using the following command:

    node scrape_quotes.js
    
    bash

    After the script finishes running, you should see a quotes.json file created in your project folder with the scraped quotes.

    Step 4: Load quotes from the JSON file (optional)

    If you want to load the saved quotes from the JSON file into a JavaScript object later, you can create another script called load_quotes.js with the following code:

    const fs = require("fs");
    
    const loadQuotesFromFile = (fileName) => {
      const jsonData = fs.readFileSync(fileName, "utf-8");
      return JSON.parse(jsonData);
    };
    
    const main = () => {
      const quotes = loadQuotesFromFile("quotes.json");
      console.log(quotes);
    };
    
    main();
    
    javascript

    This script defines a loadQuotesFromFile function that reads the JSON data from the specified file and parses it into a JavaScript object. The main function calls this function to load the quotes and logs them to the console.

    Run the script with the command:

    node load_quotes.js
    
    bash

    This should output the quotes as a JavaScript object in the console.

    That's it! You've successfully created a web scraping script to fetch quotes using Cheerio and Node.js, and saved the quotes to a JSON file for later use.


    Thanks for reading. If you enjoyed this post, I invite you to explore more of my site. I write about web development, programming, and other fun stuff.