Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Self-Host] call to playwright is failing #902

Open
rostwal95 opened this issue Nov 15, 2024 · 12 comments
Open

[Self-Host] call to playwright is failing #902

rostwal95 opened this issue Nov 15, 2024 · 12 comments
Assignees

Comments

@rostwal95
Copy link

Describe the Issue
Call to playwright fails when trying to scrape with playwright.

To Reproduce
Steps to reproduce the issue:

  1. Configure the environment or settings with '...'
  2. Run the command '...'
  3. Observe the error or unexpected output at '...'
  4. Log output/error message

Expected Behavior
The call to playwright should be successful and dynamic js should be rendered and cleaned up.

Screenshots
If applicable, add screenshots or copies of the command line output to help explain the self-hosting issue.

Environment (please complete the following information):

  • OS: [e.g. macOS, Linux, Windows]
  • Firecrawl Version: [e.g. 1.2.3]
  • Node.js Version: [e.g. 14.x]
  • Docker Version (if applicable): [e.g. 20.10.14]
  • Database Type and Version: [e.g. PostgreSQL 13.4]

Logs
worker-1 | 2024-11-15 05:13:48 debug [ScrapeURL:]: Engine docx meets feature priority threshold
worker-1 | 2024-11-15 05:13:48 info [ScrapeURL:]: Scraping via playwright...
worker-1 | 2024-11-15 05:13:48 debug [ScrapeURL:scrapeURLWithPlaywright]: Sending request...
worker-1 | 2024-11-15 05:13:48 debug [ScrapeURL:scrapeURLWithPlaywright]: Request sent failure status
worker-1 | 2024-11-15 05:13:48 info [ScrapeURL:]: An unexpected error happened while scraping with playwright.
worker-1 | 2024-11-15 05:13:48 info [ScrapeURL:]: Scraping via fetch...

here are the logs

Configuration
Provide relevant parts of your configuration files (with sensitive information redacted).

Additional Context
Add any other context about the self-hosting issue here, such as specific infrastructure details, network setup, or any modifications made to the original Firecrawl setup.

@mogery
Copy link
Member

mogery commented Nov 15, 2024

Can you share the logs of the playwright microservice as well?

@mogery mogery self-assigned this Nov 15, 2024
@mkaskov
Copy link

mkaskov commented Nov 15, 2024

the same problem.
with apps/playwright-service-ts

playwright-service-1 | SyntaxError: Unexpected token " in JSON at position 0
playwright-service-1 | at JSON.parse ()
playwright-service-1 | at createStrictSyntaxError (/usr/src/app/node_modules/body-parser/lib/types/json.js:169:10)
playwright-service-1 | at parse (/usr/src/app/node_modules/body-parser/lib/types/json.js:86:15)
playwright-service-1 | at /usr/src/app/node_modules/body-parser/lib/read.js:128:18
playwright-service-1 | at AsyncResource.runInAsyncScope (node:async_hooks:203:9)
playwright-service-1 | at invokeCallback (/usr/src/app/node_modules/raw-body/index.js:238:16)
playwright-service-1 | at done (/usr/src/app/node_modules/raw-body/index.js:227:7)
playwright-service-1 | at IncomingMessage.onEnd (/usr/src/app/node_modules/raw-body/index.js:287:7)
playwright-service-1 | at IncomingMessage.emit (node:events:517:28)
playwright-service-1 | at endReadableNT (node:internal/streams/readable:1400:12)

@mogery
Copy link
Member

mogery commented Nov 15, 2024

I just made a change, I think the way we sent the request to the microservice was wrong. Can you rebuild firecrawl (no need to rebuild playwright-service) and try again?

@rostwal95
Copy link
Author

I am getting errors while building the docker container as well -

=> ERROR [playwright-service 2/6] RUN apt-get update && apt-get install -y --no-install-recommends gcc libstdc++6 0.9s

[playwright-service 2/6] RUN apt-get update && apt-get install -y --no-install-recommends gcc libstdc++6:
0.539 Get:1 http://deb.debian.org/debian bookworm InRelease [151 kB]
0.645 Err:1 http://deb.debian.org/debian bookworm InRelease
0.645 At least one invalid signature was encountered.
0.648 Get:2 http://deb.debian.org/debian bookworm-updates InRelease [55.4 kB]
0.677 Err:2 http://deb.debian.org/debian bookworm-updates InRelease
0.677 At least one invalid signature was encountered.
0.693 Get:3 http://deb.debian.org/debian-security bookworm-security InRelease [48.0 kB]
0.717 Err:3 http://deb.debian.org/debian-security bookworm-security InRelease
0.717 At least one invalid signature was encountered.
0.722 Reading package lists...
0.728 W: GPG error: http://deb.debian.org/debian bookworm InRelease: At least one invalid signature was encountered.
0.728 E: The repository 'http://deb.debian.org/debian bookworm InRelease' is not signed.
0.728 W: GPG error: http://deb.debian.org/debian bookworm-updates InRelease: At least one invalid signature was encountered.
0.728 E: The repository 'http://deb.debian.org/debian bookworm-updates InRelease' is not signed.
0.728 W: GPG error: http://deb.debian.org/debian-security bookworm-security InRelease: At least one invalid signature was encountered.
0.728 E: The repository 'http://deb.debian.org/debian-security bookworm-security InRelease' is not signed.


failed to solve: process "/bin/sh -c apt-get update && apt-get install -y --no-install-recommends gcc libstdc++6" did not complete successfully: exit code: 100

@rostwal95
Copy link
Author

rostwal95 commented Nov 15, 2024

I still see the issue, not sure why the logging level is not marked as error -

worker-1 | 2024-11-15 16:23:58 info [:]: 🐂 Worker taking job b2c3e207-55ca-4abb-8be1-57a0b1b88cd2
worker-1 | 2024-11-15 16:23:58 info [ScrapeURL:]: Scraping URL "https://www.britishairways.com/travel/home/public/en_us/"...
worker-1 | 2024-11-15 16:23:58 debug [ScrapeURL:]: Engine scrapingbee meets feature priority threshold
worker-1 | 2024-11-15 16:23:58 debug [ScrapeURL:]: Engine scrapingbeeLoad meets feature priority threshold
worker-1 | 2024-11-15 16:23:58 debug [ScrapeURL:]: Engine playwright meets feature priority threshold
worker-1 | 2024-11-15 16:23:58 debug [ScrapeURL:]: Engine fetch meets feature priority threshold
worker-1 | 2024-11-15 16:23:58 debug [ScrapeURL:]: Engine pdf meets feature priority threshold
worker-1 | 2024-11-15 16:23:58 debug [ScrapeURL:]: Engine docx meets feature priority threshold
worker-1 | 2024-11-15 16:23:58 info [ScrapeURL:]: Scraping via scrapingbee...
worker-1 | 2024-11-15 16:23:59 error [ScrapeURL:]: ScrapingBee threw an error {"module":"ScrapeURL","scrapeId":"b2c3e207-55ca-4abb-8be1-57a0b1b88cd2","method":"","engine":"scrapingbee","body":{"message":"Invalid api key: # use if you'd like to use as a fallback scraper"}}
worker-1 | 2024-11-15 16:23:59 info [ScrapeURL:]: Engine scrapingbee could not scrape the page.
worker-1 | 2024-11-15 16:23:59 info [ScrapeURL:]: Scraping via scrapingbeeLoad...
worker-1 | 2024-11-15 16:23:59 error [ScrapeURL:]: ScrapingBee threw an error {"module":"ScrapeURL","scrapeId":"b2c3e207-55ca-4abb-8be1-57a0b1b88cd2","method":"","engine":"scrapingbeeLoad","body":{"message":"Invalid api key: # use if you'd like to use as a fallback scraper"}}
worker-1 | 2024-11-15 16:23:59 info [ScrapeURL:]: Engine scrapingbeeLoad could not scrape the page.
worker-1 | 2024-11-15 16:23:59 info [ScrapeURL:]: Scraping via playwright...
worker-1 | 2024-11-15 16:23:59 debug [ScrapeURL:scrapeURLWithPlaywright]: Sending request...
worker-1 | 2024-11-15 16:23:59 debug [ScrapeURL:scrapeURLWithPlaywright]: Request failed
worker-1 | 2024-11-15 16:23:59 info [ScrapeURL:]: An unexpected error happened while scraping with playwright.

worker-1 | 2024-11-15 16:23:59 info [ScrapeURL:]: Scraping via fetch...
worker-1 | 2024-11-15 16:24:01 info [ScrapeURL:]: Scrape via fetch deemed successful.
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Executing transformer deriveHTMLFromRawHTML...
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Finished executing transformer deriveHTMLFromRawHTML (7ms)
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Executing transformer deriveMarkdownFromHTML...
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Finished executing transformer deriveMarkdownFromHTML (1ms)
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Executing transformer deriveLinksFromHTML...
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Finished executing transformer deriveLinksFromHTML (0ms)
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Executing transformer deriveMetadataFromRawHTML...
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Finished executing transformer deriveMetadataFromRawHTML (4ms)
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Executing transformer uploadScreenshot...
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Finished executing transformer uploadScreenshot (0ms)
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Executing transformer performLLMExtract...
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Finished executing transformer performLLMExtract (0ms)
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Executing transformer coerceFieldsToFormats...
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Finished executing transformer coerceFieldsToFormats (0ms)
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Executing transformer removeBase64Images...
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Finished executing transformer removeBase64Images (0ms)
worker-1 | 2024-11-15 16:24:01 info [:]: 🐂 Job done b2c3e207-55ca-4abb-8be1-57a0b1b88cd2

response has empty markdown -

{
"success": true,
"data": {
"markdown": "",
"metadata": {
"title": "British Airways | Book Flights, Holidays, City Breaks & Check In Online",
"description": "Save on worldwide flights and holidays when you book directly with British Airways. Browse our guides, find great deals, manage your booking and check in online.",
"language": "en",
"robots": "all",
"ogLocaleAlternate": [],
"theme-color": "#ffffff",
"viewport": "width=device-width, initial-scale=1",
"sourceURL": "https://www.britishairways.com/travel/home/public/en_us/",
"url": "https://www.britishairways.com/travel/home/public/en_us/",
"statusCode": 200
}
}
}

@mkaskov
Copy link

mkaskov commented Nov 20, 2024

another error. after that happens firecrawl start working not correct

worker-1 | 2024-11-20 06:40:56 info [ScrapeURL:]: An unexpected error happened while scraping with playwright.
worker-1 | 2024-11-20 06:40:56 info [ScrapeURL:]: Scraping via fetch...
worker-1 | 2024-11-20 06:40:57 info [ScrapeURL:]: Scrape via fetch deemed successful.
worker-1 | 2024-11-20 06:40:57 info [:]: 🐂 Job done 79431bc8-736d-4379-bf0d-ddae76e0dabe
api-1 | 2024-11-20 06:40:57 warn [:]: You're bypassing authentication {}
playwright-service-1 | ✅ Scrape successful!
worker-1 | 2024-11-20 06:40:57 info [ScrapeURL:]: An unexpected error happened while scraping with playwright.
worker-1 | 2024-11-20 06:40:57 info [ScrapeURL:]: Scraping via fetch...
playwright-service-1 | ✅ Scrape successful!
worker-1 | 2024-11-20 06:40:57 info [ScrapeURL:]: An unexpected error happened while scraping with playwright.
worker-1 | 2024-11-20 06:40:57 info [ScrapeURL:]: Scraping via fetch...
playwright-service-1 | ✅ Scrape successful!
worker-1 | 2024-11-20 06:40:57 info [ScrapeURL:]: Scrape via fetch deemed successful.
worker-1 | 2024-11-20 06:40:58 info [ScrapeURL:]: An unexpected error happened while scraping with playwright.
worker-1 | 2024-11-20 06:40:58 info [ScrapeURL:]: Scraping via fetch...
worker-1 | 2024-11-20 06:40:58 info [ScrapeURL:]: Scrape via fetch deemed successful.
worker-1 | 2024-11-20 06:40:58 info [ScrapeURL:]: Scrape via fetch deemed successful.
worker-1 | 2024-11-20 06:40:58 info [:]: 🐂 Job done 27e3c51f-1b4a-45a9-8b4f-abe61f67ac8a
worker-1 | 2024-11-20 06:40:58 info [:]: 🐂 Job done e2ec67bb-740b-40fa-803d-4918ced6006c
worker-1 | 2024-11-20 06:40:58 info [:]: 🐂 Job done a252a1c1-aa4c-4f0c-960b-c379292cb997
worker-1 | 2024-11-20 06:40:59 info [:]: 🐂 Worker taking job 0ad857e1-70f6-4c2d-9255-2c890f207c5a
worker-1 | 2024-11-20 06:40:59 error [:]: 🐂 Job errored 0ad857e1-70f6-4c2d-9255-2c890f207c5a - TypeError: Cannot read properties of undefined (reading 'timeout') {}
worker-1 | 2024-11-20 06:40:59 error [:]: undefined {}
worker-1 | 2024-11-20 06:40:59 error [:]: TypeError: Cannot read properties of undefined (reading 'timeout')
worker-1 | at processJob (/app/dist/src/services/queue-worker.js:249:40)
worker-1 | at processJobInternal (/app/dist/src/services/queue-worker.js:65:30)
worker-1 | at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {}
worker-1 | /app/dist/src/main/runWebScraper.js:18
worker-1 | formats: job.data.scrapeOptions.formats.concat(["rawHtml"]),
worker-1 | ^
worker-1 |
worker-1 | TypeError: Cannot read properties of undefined (reading 'formats')
worker-1 | at startWebScraperPipeline (/app/dist/src/main/runWebScraper.js:18:49)
worker-1 | at processJob (/app/dist/src/services/queue-worker.js:245:57)
worker-1 | at processJobInternal (/app/dist/src/services/queue-worker.js:65:30)
worker-1 | at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
worker-1 |
worker-1 | Node.js v20.18.0
worker-1 exited with code 1

@lauridskern
Copy link

same issue for me

@fatwang2
Copy link

same issue

@shingoxray
Copy link

same issue +1

1 similar comment
@zhucan
Copy link

zhucan commented Dec 3, 2024

same issue +1

@Hanxiao-Adam-Qi
Copy link

Same issue "info [ScrapeURL:]: An unexpected error happened while scraping with playwright. ", both in original playwright and playwright-ts

@riddlegit
Copy link

riddlegit commented Dec 9, 2024

Cannot read properties of undefined (reading 'timeout')

I guess this kind of error should be caused by some job config properties missing, maybe try to add a "timeout" property in json job data or scrapOptions.
https://docs.firecrawl.dev/v1-welcome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants