From 7aa7377981f69631147adeb34928b29e77e26721 Mon Sep 17 00:00:00 2001 From: gelu Date: Fri, 22 May 2026 19:05:41 +0300 Subject: [PATCH] Add CV Matcher wiki page --- CV-Matcher.md | 174 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 174 insertions(+) create mode 100644 CV-Matcher.md diff --git a/CV-Matcher.md b/CV-Matcher.md new file mode 100644 index 0000000..03d3a80 --- /dev/null +++ b/CV-Matcher.md @@ -0,0 +1,174 @@ +# CV Matcher + +The CV matcher is the core feature of myAi.ro. Users upload a CV PDF and either paste a single job URL/description or rely on the RAG index to find the best matches — and get a scored, structured analysis from an LLM with strengths, gaps, and recommendations. + +## Service Chain + +``` +Browser / web + -> api (port 8080) -- captcha, rate limiting, email, CV file cache + -> cv-matcher-api (port 8082) -- match logic, RAG orchestration, LLM scoring + -> rag-api (port 8081) -- vector indexing and semantic search + -> OpenAI / Ollama -- LLM scoring (gpt-4o-mini by default) +``` + +`api` is the only internet-facing service. All calls to `cv-matcher-api` and `rag-api` require the `X-Internal-Api-Key` header. + +## Flows + +### 1 -- CV Upload + +1. Browser `POST /api/cv-matcher/upload` (multipart PDF, GDPR consent, captcha token) +2. `api` verifies reCAPTCHA, forwards PDF to `cv-matcher-api POST /api/cv/upload` +3. `cv-matcher-api` calls `rag-api POST /api/rag/index` to chunk and embed the PDF +4. `rag-api` returns `{ documentId, textHash, chunks, characters, cached }` +5. `api` caches the PDF to `{FileStorage:Path}/{documentId}.pdf` for later email attachment +6. Returns `CvUploadResponse` to the browser + +If the same PDF was previously uploaded (same `textHash`), `rag-api` returns the cached document — no re-embedding cost. + +### 2 -- Match CV to a Single Job + +1. Browser `POST /api/cv-matcher/match-job` with `{ cvDocumentId, jobUrl or jobDescription, email, gdprConsent, captchaToken }` +2. `api` verifies reCAPTCHA, forwards to `cv-matcher-api POST /api/cv/match-job` +3. `cv-matcher-api`: + - Fetches CV text from `rag-api GET /api/rag/document/{cvDocumentId}` + - Fetches and strips HTML from `jobUrl` via `JobTextExtractor` (or uses pasted `jobDescription`) + - Indexes the job text into `rag-api` (type = "job") + - Runs a semantic search against the RAG index to find matching job chunks + - Calls `ScorePairAsync` (LLM) to produce the structured match result + - Caches the result in `cvMatcher.CvMatchResults` by `(cvDocumentId, jobDocumentId)` hash +4. `api` (on return): + - If `email` was provided, creates a job search token via `IJobSearchApi.CreateTokenAsync` + - Sends match result email with CV PDF attached and job search link included +5. Returns `JobMatchResponse` to the browser + +### 3 -- Find Jobs from RAG Index + +1. Browser `POST /api/cv-matcher/find-jobs` with `{ cvDocumentId, topK }` +2. `cv-matcher-api` fetches CV text from `rag-api` +3. Builds a CV search profile string from the CV text +4. Calls `rag-api` semantic search against indexed jobs (`targetDocumentTypes: ["job"]`) +5. Takes top `DeepScoreTopN` results (default 5), runs `ScorePairAsync` LLM scoring on each +6. Returns `FindJobsResponse { jobs: JobMatchResponse[] }` + +## LLM Scoring (`ScorePairAsync`) + +Called for both match-job and find-jobs. Checks the DB cache first -- if a result exists for the same `(cvId, jobId)` pair it is returned immediately (no AI call). + +If not cached: +- Truncates CV text to 18 000 chars, job text to 14 000 chars +- Takes up to 4 RAG evidence chunks (or first 4 000 chars of job text as fallback) +- Sends `system + user` prompt to the configured AI provider with `temperature = 0.2` +- Expects JSON response; falls back to a safe error object if parsing fails +- Persists the raw AI chat response in `cvMatcher.CvMatcherChatCache` by a hash of `(provider, model, temperature, systemPrompt, userPrompt)` + +### Match Result Structure (`JobMatchResponse`) + +| Field | Type | Description | +|-------|------|-------------| +| `score` | int 0-100 | Overall match percentage | +| `summary` | string | One-paragraph narrative | +| `strengths` | string[] | CV aspects that match well | +| `gaps` | string[] | Missing or weak areas | +| `recommendations` | string[] | Actionable advice for the candidate | +| `evidence` | string[] | RAG chunks that drove the score | +| `cached` | bool | True if returned from DB cache | +| `jobDocumentId` | string? | RAG document id of the indexed job | +| `jobUrl` | string? | Source URL of the job | + +## JobTextExtractor + +Extracts plain text from a job posting for the LLM prompt. + +- If `jobDescription` (pasted text) is provided it is used directly -- no HTTP call +- Otherwise fetches `jobUrl`, strips `