feat(providers): add headless browser scraping via Playwright for SPA job sites
Build and Push Docker Images Staging / build (push) Successful in 5m20s
Build and Push Docker Images Staging / build (push) Successful in 5m20s
ejobs.ro migrated to a Nuxt SPA - plain HTTP GET returns only the JS bundle. This change equips cv-search-job with a headless Chromium (Playwright 1.60) so it can fully render SPA pages before extracting job links. - Add UseHeadlessBrowser flag to JobProviderEntity, JobProviderConfig, and CvSearchDbContext; map it in JobTokenService.ToConfig so the flag is included in the session provider-config snapshot - Migration: add UseHeadlessBrowser column; fix ejobs.ro search URL (remove /user/ prefix that caused 404) and set UseHeadlessBrowser=true - HtmlJobSearcher: detect flag and dispatch to FetchWithPlaywrightAsync; plain-HTTP path is unchanged; NetworkIdle timeout falls back to partial content rather than failing outright - Dockerfile: download Playwright Chromium in the SDK build stage via npx; copy browser binaries to the final image; install Chromium system libs (Ubuntu noble t64 variants) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -29,9 +29,28 @@ COPY Helpers/startup-helpers/ Helpers/startup-helpers/
|
||||
|
||||
RUN dotnet publish Jobs/cv-search-job/cv-search-job.csproj -c $BUILD_CONFIGURATION -o /app/publish /p:UseAppHost=false
|
||||
|
||||
# Download Playwright Chromium browser in the build stage.
|
||||
# Node.js is only needed here to run npx — it is not copied to the final image.
|
||||
ENV PLAYWRIGHT_BROWSERS_PATH=/ms-playwright
|
||||
RUN apt-get update && apt-get install -y --no-install-recommends nodejs npm \
|
||||
&& npx --yes playwright@1.60.0 install chromium \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
FROM mcr.microsoft.com/dotnet/aspnet:10.0 AS final
|
||||
WORKDIR /app
|
||||
|
||||
# System libraries required by Chromium on Debian bookworm
|
||||
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||
libnss3 libnspr4 libatk1.0-0 libatk-bridge2.0-0 libcups2 libdrm2 \
|
||||
libxkbcommon0 libxcomposite1 libxdamage1 libxfixes3 libxrandr2 \
|
||||
libgbm1 libasound2t64 libpango-1.0-0 libcairo2 libatspi2.0-0 \
|
||||
libwayland-client0 libx11-xcb1 libx11-6 libxcb1 libxext6 \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Copy the Playwright Chromium browser from the build stage
|
||||
ENV PLAYWRIGHT_BROWSERS_PATH=/ms-playwright
|
||||
COPY --from=build /ms-playwright /ms-playwright
|
||||
|
||||
COPY --from=build /app/publish .
|
||||
|
||||
ENTRYPOINT ["dotnet", "cv-search-job.dll"]
|
||||
|
||||
Reference in New Issue
Block a user