Back to Blog
Guides
Raluca PenciucMar 3, 20238 min read

How to Web Scrape Yelp.com (2023 Update) - A Step-by-Step Guide

How to Web Scrape Yelp.com (2023 Update) - A Step-by-Step Guide

Environment setup

Before we begin, let's ensure we have the necessary tools.

First, download and install Node.js from the official website, making sure to use the Long-Term Support (LTS) version. This will also automatically install Node Package Manager (NPM) which we will use to install further dependencies.

For this tutorial, we will be using Visual Studio Code as our Integrated Development Environment (IDE) but you can use any other IDE of your choice. Create a new folder for your project, open the terminal, and run the following command to set up a new Node.js project:

npm init -y

This will create a package.json file in your project directory, which will store information about your project and its dependencies.

Next, we need to install TypeScript and the type definitions for Node.js. TypeScript offers optional static typing which helps prevent errors in the code. To do this, run in the terminal:

npm install typescript @types/node --save-dev

You can verify the installation by running:

npx tsc --version

TypeScript uses a configuration file called tsconfig.json to store compiler options and other settings. To create this file in your project, run the following command:

npx tsc -init

Make sure that the value for “outDir” is set to “dist”. This way we will separate the TypeScript files from the compiled ones. You can find more information about this file and its properties in the official TypeScript documentation.

Now, create an “src” directory in your project, and a new “index.ts” file. Here is where we will keep the scraping code. To execute TypeScript code you have to compile it first, so to make sure that we don’t forget this extra step, we can use a custom-defined command.

Head over to the “package. json” file, and edit the “scripts” section like this:

"scripts": {

    "test": "npx tsc && node dist/index.js"

}

This way, when you will execute the script, you just have to type “npm run test” in your terminal.

Finally, to scrape the data from the website, we will use Puppeteer, a headless browser library for Node.js that allows you to control a web browser and interact with websites programmatically. To install it, run this command in the terminal:

npm install puppeteer

It is highly recommended when you want to ensure the completeness of your data, as many websites today contain dynamic-generated content. If you’re curious, you can check out before continuing the Puppeteer documentation to fully see what it’s capable of.

Data location

Now that you have your environment set up, we can start looking at extracting the data. For this article, I chose to scrape the page of an Irish restaurant from Dublin: https://www.yelp.ie/biz/the-boxty-house-dublin?osq=Restaurants.

We’re going to extract the following data:

  • the restaurant name;
  • the restaurant rating;
  • the restaurant's number of reviews;
  • the business website;
  • the business phone number;
  • the restaurant's physical addresses.

You can see all this information highlighted in the screenshot below:

Yelp business page with highlighted areas for the restaurant name, rating, and contact information

By opening the Developer Tools on each of these elements you will be able to notice the CSS selectors that we will use to locate the HTML elements. If you’re fairly new to how CSS selectors work, feel free to reach out to this beginner guide.

Extracting the data

Before writing our script, let’s verify that the Puppeteer installation went all right:

import puppeteer from 'puppeteer';

async function scrapeYelpData(yelp_url: string): Promise<void> {

    // Launch Puppeteer

    const browser = await puppeteer.launch({

        headless: false,

    	  args: ['--start-maximized'],

    	  defaultViewport: null

    })

    // Create a new page

    const page = await browser.newPage()

    // Navigate to the target URL

    await page.goto(yelp_url)

    // Close the browser

    await browser.close()

}

scrapeYelpData("https://www.yelp.ie/biz/the-boxty-house-dublin?osq=Restaurants")

Here we open a browser window, create a new page, navigate to our target URL, and close the browser. For the sake of simplicity and visual debugging, I open the browser window maximized in non-headless mode.

Now, let’s take a look at the website’s structure:

Yelp business page with browser devtools highlighting the HTML for the listing title and star rating

It seems that Yelp displays a somewhat difficult page structure, as the class names are randomly generated and very few elements have unique attribute values.

But fear not, we can get creative with the solution. Firstly, to get the restaurant name, we target the only “h1” element present on the page.

// Extract restaurant name

const restaurant_name = await page.evaluate(() => {

    const name = document.querySelector('h1')

    return name ? name.textContent : ''

})

console.log(restaurant_name)

Now, to get the restaurant rating, you can notice that beyond the star icons, the explicit value is present in the attribute “aria-label”. So, we target the “div” element whose “aria-label” attribute ends with the “star rating” string.

// Extract restaurant rating

const restaurant_rating = await page.evaluate(() => {

    const rating = document.querySelector('div[aria-label$="star rating"]')

    return rating ? rating.getAttribute('aria-label') : ''

})

console.log(restaurant_rating)

And finally (for this particular HTML section), we see that we can easily get the review number by targeting the highlighted anchor element.

// Extract restaurant reviews

const restaurant_reviews = await page.evaluate(() => {

    const reviews = document.querySelector('a[href="#reviews"]')

    return reviews ? reviews.textContent : ''

})

console.log(restaurant_reviews)

Easy peasy. Let’s take a look at the business information widget:

Yelp contact card with highlighted website URL, phone number, and directions, alongside devtools HTML view

Unfortunately, in this situation, we cannot rely on CSS selectors. Luckily, we can make use of another method to locate the HTML elements: XPath. If you’re fairly new to how CSS selectors work, feel free to reach out to this beginner guide.

To extract the restaurant’s website: we apply the following logic:

locate the “p” element that has “Business website” as text content;

locate the following sibling

locate the anchor element and its “href” attribute.

// Extract restaurant website

const restaurant_website_element = await page.$x("//p[contains(text(), 'Business website')]/following-sibling::p/a/@href")

const restaurant_website = await page.evaluate(

    element => element.nodeValue,

    restaurant_website_element[0]

)

console.log(restaurant_website)

Now, for the phone number and the address we can follow the exact same logic, with 2 exceptions:

  • for the phone number, we stop the following sibling and extract its textContent property;
  • for the address, we target the following sibling of the parent element.
// Extract restaurant phone number

const restaurant_phone_element = await page.$x("//p[contains(text(), 'Phone number')]/following-sibling::p")

const restaurant_phone = await page.evaluate(

    element => element.textContent,

    restaurant_phone_element[0]

)

console.log(restaurant_phone)

// Extract restaurant address

const restaurant_address_element = await page.$x("//a[contains(text(), 'Get Directions')]/parent::p/following-sibling::p")

const restaurant_address = await page.evaluate(

    element => element.textContent,

    restaurant_address_element[0]

)

console.log(restaurant_address)

The final result should look like this:

The Boxty House

4.5 star rating

948 reviews

/biz_redir?url=http%3A%2F%2Fwww.boxtyhouse.ie%2F&cachebuster=1673542348&website_link_type=website&src_bizid=EoMjdtjMgm3sTv7dwmfHsg&s=16fbda8bbdc467c9f3896a2dcab12f2387c27793c70f0b739f349828e3eeecc3

(01) 677 2762

20-21 Temple Bar Dublin 2

Bypass bot detection

While scraping Yelp may seem easy at first, the process can become more complex and challenging as you scale up your project. The website implements various techniques to detect and prevent automated traffic, so your scaled-up scraper starts getting blocked.

Yelp collects multiple browser data to generate and associate you with a unique fingerprint. Some of these are:

  • properties from the Navigator object (deviceMemory, hardwareConcurrency, platform, userAgent, webdriver, etc.)
  • timing and performance checks
  • service workers
  • screen dimensions checks
  • and many more

One way to overcome these challenges and continue scraping at a large scale is to use a scraping API. These kinds of services provide a simple and reliable way to access data from websites like yelp.com, without the need to build and maintain your own scraper.

WebScrapingAPI is an example of such a product. Its proxy rotation mechanism avoids CAPTCHAs altogether, and its extended knowledge base makes it possible to randomize the browser data so it will look like a real user.

The setup is quick and easy. All you need to do is register an account, so you’ll receive your API key. It can be accessed from your dashboard, and it’s used to authenticate the requests you send.

Dashboard quickstart guide showing three steps: API access key, API Playground, and integration into your application

As you have already set up your Node.js environment, we can make use of the corresponding SDK. Run the following command to add it to your project dependencies:

npm install webscrapingapi

Now all it’s left to do is to send a GET request so we receive the website’s HTML document. Note that this is not the only way you can access the API.

import webScrapingApiClient from 'webscrapingapi';

const client = new webScrapingApiClient("YOUR_API_KEY");

async function exampleUsage() {

    const api_params = {

        'render_js': 1,

    	  'proxy_type': 'residential',

    }

    const URL = "https://www.yelp.ie/biz/the-boxty-house-dublin?osq=Restaurants"

    const response = await client.get(URL, api_params)

    if (response.success) {

        console.log(response.response.data)

    } else {

        console.log(response.error.response.data)

    }

}

exampleUsage();

By enabling the “render_js” parameter, we send the request using a headless browser, just like you previously did along with this tutorial.

After receiving the HTML document, you can use another library to extract the data of interest, like Cheerio. Never heard of it? Check out this guide to help you get started!

Conclusion

This article has presented you with a comprehensive guide on how to web scrape Yelp using TypeScript and Puppeteer. We have gone through the process of setting up the environment, locating and extracting data, and why using a professional scraper is a better solution than creating your own.

The data scraped from Yelp can be used for various purposes such as identifying market trends, analyzing customer sentiment, monitoring competitors, creating targeted marketing campaigns, and many more.

Overall, web scraping Yelp.com can be a valuable asset for anyone looking to gain a competitive advantage in their local market and this guide has provided a great starting point to do so.

About the Author
Raluca Penciuc, Full-Stack Developer @ WebScrapingAPI
Raluca PenciucFull-Stack Developer

Raluca Penciuc is a Full Stack Developer at WebScrapingAPI, building scrapers, improving evasions, and finding reliable ways to reduce detection across target websites.

Start Building

Ready to Scale Your Data Collection?

Join 2,000+ companies using WebScrapingAPI to extract web data at enterprise scale with zero infrastructure overhead.