Trustpilot Review Scraper

Welcome to the Trustpilot Review Scraper, a Python-based tool designed for scraping and analyzing detailed reviews from Trustpilot.com using the trustpilot.py script.

With TrustpilotReviewScraper and the trustpilot.py script, you can easily extract and analyze reviews for a specific domain on Trustpilot, filtering by star ratings, date ranges, languages, and more. This tool supports additional capabilities such as sorting, visualization, and exporting data for comprehensive analysis.

Features

🔍 Comprehensive Data Collection: Extract all available data for Trustpilot reviews, including ratings, titles, content, dates, and more.
🌍 Multi-Language Support: Fetch reviews in all available languages, allowing for global data analysis.
📄 Multi-Page Scraping: Navigates and collects data from multiple pages automatically.
💾 Flexible Output Options: Offers CSV, JSON, or both formats for output, catering to various data processing needs.
🚿 Customizable Filtering: Filter reviews by star ratings, date ranges, search keywords, languages, and more.
🔀 Sorting Capabilities: Sort reviews by various fields, such as rating, date, relevance, etc.
💻 Easy-to-Use Command-Line Interface: Simple and intuitive command-line interface for easy data collection.
📊 Advanced Analytics: Analyze keyword correlations with ratings and visualize review trends.
🌎 Geographic Analysis: Group and visualize reviews by location with detailed charts.
🔄 Retry Logic: Built-in retry mechanism for handling slow-loading or dynamic pages.

Getting Started

Prerequisites

Tested with Python 3.12.1. Earlier versions may not be compatible.
Required Dependencies:
- Python libraries listed in requirements.txt:
  - pandas
  - requests
  - matplotlib
  - lxml

Installation

Clone the Repository Start by cloning the repository to your local machine:

git clone https://github.com/robertvy/TrustpilotReviewScraper.git
cd TrustpilotReviewScraper

Set Up a Virtual Environment Creating a virtual environment is recommended to avoid any conflicts with other Python projects:

Unix/macOS:

python3 -m venv env
source env/bin/activate

Windows:

python -m venv env
.\env\Scripts\activate

Install Required Packages Install all the necessary packages using pip:

pip install -r requirements.txt

Usage

Running the `trustpilot.py` Script

To scrape reviews for a specific domain on Trustpilot, use the following command, replacing [domain] with the actual domain you wish to scrape:

python trustpilot.py [domain]

Required Argument

domain: The domain to scrape reviews for, without the need for a flag. It is the first positional argument.

Optional Arguments

--stars [N ...]: Filter reviews by star ratings. Accepts multiple values. For instance, --stars 4 5 will only fetch reviews rated with 4 or 5 stars.
--date: Filter reviews by date. Options include 'last30days', 'last3months', 'last6months'. Example usage: --date last30days.
--search: Filter reviews by a search keyword. Example usage: --search "excellent service".
--languages: Specify the language of reviews to fetch (e.g., 'en' for English, 'es' for Spanish). The default is 'all', which collects reviews in all languages. Example usage: --languages en.
--verified: If used, the script will only fetch reviews that are verified. No additional value needs to be specified. Example usage: --verified.
--replies: If used, the script will only fetch reviews that have replies. No additional value needs to be specified. Example usage: --replies.
--sort-by: Specify the field to sort reviews by. Options include 'rating', 'date', etc. The default is 'published_date'. Example usage: --sort-by rating.
--sort-order: Specify the order to sort reviews in. Options are 'asc' for ascending and 'desc' for descending. The default is 'asc'. Example usage: --sort-order desc.
--output: Choose the output format of the scraped data. Options are 'csv', 'json', or 'both'. The default is 'csv'. Example usage: --output both.
--analyze: Perform keyword analysis and correlation with ratings. Example usage: --analyze.
--visualize: Generate visualizations of review trends by location. Creates charts in a 'charts' directory. Example usage: --visualize.
--retry: Enable retry logic for handling slow-loading or dynamic pages. Example usage: --retry.

Analytics Features

Keyword Analysis

When using the --analyze flag, the script will:

Calculate correlations between keywords and ratings
Generate a statistical significance report
Show strength indicators for correlations
Output results showing positive and negative associations

Example output:

Keyword correlation analysis:
(negative values indicate lower ratings)
----------------------------------------
excellent  :  0.412 (+) ***
terrible   : -0.523 (!) **

Where:

***, **, * indicate significance levels
(!) indicates strong correlation
(+) indicates moderate correlation

Geographic Visualization

When using the --visualize flag, the script will:

Generate charts in a 'charts' directory
Create bar charts showing:
- Number of reviews by country
- Average ratings by country
Sort data for better visualization
Save as PNG files for easy sharing

Example Usage Commands and Outputs

Below are examples of how to use the script along with a description of expected outputs:

Fetch reviews for domain example.com with only 5-star ratings, sorted by rating in descending order, and output to JSON:

python trustpilot.py example.com --stars 5 --sort-by rating --sort-order desc --output json

Expected output:

A JSON file reviews_example.com_[timestamp].json will be created with content structured as follows:

[
   {
      "id": "123abc...",
      "filtered": false,
      "pending": false,
      "text": "The service was amazing. Highly recommended.",
      "rating": 5,
      "title": "Great service!",
      "likes": 0,
      "language": "en",
      "location": "London, United Kingdom",
      "published_date": "2023-10-01T14:30:00+00:00",
      "display_name": "John D.",
      "country_code": "GB",
      "consumer_verified": true,
      "review_verified": true,
      "reply_message": null,
      // ... additional fields
   },
   // ... more reviews
]

Fetch reviews for domain example.com over the last 30 days with replies, save as both CSV and JSON:

python trustpilot.py example.com --date last30days --replies --output both

Expected outputs:

CSV file reviews_example.csv and JSON file reviews_example.json containing the requested review data.

Archived Outputs

When the script processes reviews, output files are stored locally in the working directory based on the selected format.

CSV File Structure: Example CSV structure for scraped reviews:

Title	Content	Rating	Date	Verified	Language
Great service!	The service was amazing...	5	2023-10-01	True	en
Average	It was okay, not great...	3	2023-09-25	True	en

Keyword Analysis File: A text analysis report can include the frequency of keywords across reviews, saved in a .txt or .json file:

Keyword Analysis:
- "amazing": 15 occurrences
- "service": 27 occurrences
- "highly recommended": 10 occurrences

Visual Outputs: The script can generate bar charts for review ratings:

Example chart:

Bar chart showing distribution of ratings from 1 to 5.

Known Issues

Currently, the script does not fetch multiple reviews submitted by the same reviewer.

Contributing

Contributions are welcome! Feel free to fork the repository, make changes, and submit pull requests. If you have any suggestions or issues, please open an issue in the GitHub repository.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Disclaimer

Important Notes on Ethical Scraping

Always check and adhere to Trustpilot's robots.txt file.
Use this tool responsibly to avoid violating terms of service or impacting website performance.
This tool is for educational and research purposes only. Please use responsibly and ethically.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.idea		.idea
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
trustpilot.py		trustpilot.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Trustpilot Review Scraper

Features

Getting Started

Prerequisites

Installation

Usage

Running the `trustpilot.py` Script

Required Argument

Optional Arguments

Analytics Features

Keyword Analysis

Geographic Visualization

Example Usage Commands and Outputs

Archived Outputs

Known Issues

Contributing

License

Disclaimer

Important Notes on Ethical Scraping

About

Releases

Packages

Contributors 2

Languages

License

robertvy/TrustpilotReviewScraper

Folders and files

Latest commit

History

Repository files navigation

Trustpilot Review Scraper

Features

Getting Started

Prerequisites

Installation

Usage

Running the trustpilot.py Script

Required Argument

Optional Arguments

Analytics Features

Keyword Analysis

Geographic Visualization

Example Usage Commands and Outputs

Archived Outputs

Known Issues

Contributing

License

Disclaimer

Important Notes on Ethical Scraping

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Running the `trustpilot.py` Script

Packages