2024-09-09 web, development, javascript
Web Scraping with Cheerio
By O. Wolfson
In this tutorial we'll be fetching quotes by Rodney Dangerfield from the AZQuotes website as an example.
Prerequisites:
- Basic knowledge of JavaScript and Node.js
- Node.js installed on your computer
Step 1: Install required packages
First, create a new folder for your project and navigate to it using the command line. Then, run the following command to initialize a new Node.js project:
bashnpm init -y
Next, install Axios and Cheerio using the following command:
bashnpm install axios cheerio
Step 2: Create a web scraping script
Create a new file called scrape_quotes.js and paste the code below into the file.
javascript// Import required libraries
const axios = require("axios"); // Used for making HTTP requests
const cheerio = require("cheerio"); // Used for parsing HTML and manipulating DOM elements
const fs = require("fs"); // Used for working with the file system
// Set the base URL for the quotes website
const baseURL = "https://www.azquotes.com/author/3621-Rodney_Dangerfield";
// Define an async function to fetch quotes from a single page
const fetchQuotes = async (url) => {
try {
// Make an HTTP GET request to the specified URL
const response = await axios.get(url);
// Get the HTML content from the response
const html = response.data;
// Load the HTML content into Cheerio
const $ = cheerio.load(html);
// Select all 'a.title' elements from the HTML
const quoteElements = $("a.title");
// Initialize an empty array to store quotes
const quotes = [];
// Iterate through each quote element found
quoteElements.each((index, element) => {
// Extract the quote text and trim any whitespace
const quoteText = $(element).text().trim();
// Extract the 'data-author' attribute and trim any whitespace
const author = $(element).attr("data-author").trim();
// Check if the author is Rodney Dangerfield
if (author === "Rodney Dangerfield") {
// Add the quote to the quotes array
quotes.push({ author, quote: quoteText });
}
});
// Return the quotes array
return quotes;
} catch (error) {
// Log any errors and return an empty array
console.error(`Error fetching the website: ${error}`);
return [];
}
};
// Define an async function to fetch quotes from multiple pages
const fetchAllQuotes = async (baseURL, totalPages) => {
// Initialize an empty array to store all quotes
const allQuotes = [];
// Iterate through each page number from 1 to totalPages
for (let pageNumber = 1; pageNumber <= totalPages; pageNumber++) {
// Construct the URL for the current page
const url = `${baseURL}?p=${pageNumber}`;
// Log the URL being fetched
console.log(`Fetching quotes from: ${url}`);
// Fetch quotes from the current page using the fetchQuotes function
const quotes = await fetchQuotes(url);
// Add the fetched quotes to the allQuotes array
allQuotes.push(...quotes);
}
// Return the allQuotes array
return allQuotes;
};
// Define a function to save quotes to a JSON file
const saveQuotesToFile = (quotes, fileName) => {
// Convert the quotes array to a formatted JSON string
const jsonData = JSON.stringify(quotes, null, 2);
// Write the JSON string to the specified file
fs.writeFileSync(fileName, jsonData);
// Log a message indicating the quotes were saved to the file
console.log(`Quotes saved to ${fileName}`);
};
// Define the main async function
const main = async () => {
// Set the total number of pages to fetch
const totalPages = 2; // Change this to the total number of pages you want to fetch
// Fetch all quotes using the fetchAllQuotes function
const allQuotes = await fetchAllQuotes(baseURL, totalPages);
// Save the fetched quotes to the 'quotes.json' file
saveQuotesToFile(allQuotes, "quotes.json");
};
// Call the main function
main();
The code consists of several functions:
fetchQuotes(url)
: An async function that fetches quotes from a single page, tailored to the specific structure of the target website. It uses Axios to fetch the HTML content of the page and Cheerio to parse the HTML and extract the quotes. To create a function like this, one needs to investigate the particular HTML structure and elements' attributes of the site being scraped using a web browser's developer tools to inspect the page. This function relies on that knowledge to use jQuery-like selectors effectively, targeting and extracting the desired information from the web page.
fetchAllQuotes(baseURL, totalPages)
: An async function that iterates through multiple pages and fetches quotes from each page using the fetchQuotes function.
saveQuotesToFile(quotes, fileName)
: A function that saves the quotes array to a JSON file with the specified file name.
main()
: An async main function that fetches quotes from the AZQuotes website and saves them to a JSON file named quotes.json.
Step 3: Run the web scraping script
Run the web scraping script using the following command:
bashnode scrape_quotes.js
After the script finishes running, you should see a quotes.json file created in your project folder with the scraped quotes.
Step 4: Load quotes from the JSON file (optional)
If you want to load the saved quotes from the JSON file into a JavaScript object later, you can create another script called load_quotes.js with the following code:
javascriptconst fs = require("fs");
const loadQuotesFromFile = (fileName) => {
const jsonData = fs.readFileSync(fileName, "utf-8");
return JSON.parse(jsonData);
};
const main = () => {
const quotes = loadQuotesFromFile("quotes.json");
console.log(quotes);
};
main();
This script defines a loadQuotesFromFile function that reads the JSON data from the specified file and parses it into a JavaScript object. The main function calls this function to load the quotes and logs them to the console.
Run the script with the command:
bashnode load_quotes.js
This should output the quotes as a JavaScript object in the console.
That's it! You've successfully created a web scraping script to fetch quotes using Cheerio and Node.js, and saved the quotes to a JSON file for later use.