Add CV Matcher wiki page

2026-05-22 19:05:41 +03:00
parent 3ae2f8e437
commit 7aa7377981
+174
@@ -0,0 +1,174 @@
# CV Matcher
The CV matcher is the core feature of myAi.ro. Users upload a CV PDF and either paste a single job URL/description or rely on the RAG index to find the best matches — and get a scored, structured analysis from an LLM with strengths, gaps, and recommendations.
## Service Chain
```
Browser / web
-> api (port 8080) -- captcha, rate limiting, email, CV file cache
-> cv-matcher-api (port 8082) -- match logic, RAG orchestration, LLM scoring
-> rag-api (port 8081) -- vector indexing and semantic search
-> OpenAI / Ollama -- LLM scoring (gpt-4o-mini by default)
```
`api` is the only internet-facing service. All calls to `cv-matcher-api` and `rag-api` require the `X-Internal-Api-Key` header.
## Flows
### 1 -- CV Upload
1. Browser `POST /api/cv-matcher/upload` (multipart PDF, GDPR consent, captcha token)
2. `api` verifies reCAPTCHA, forwards PDF to `cv-matcher-api POST /api/cv/upload`
3. `cv-matcher-api` calls `rag-api POST /api/rag/index` to chunk and embed the PDF
4. `rag-api` returns `{ documentId, textHash, chunks, characters, cached }`
5. `api` caches the PDF to `{FileStorage:Path}/{documentId}.pdf` for later email attachment
6. Returns `CvUploadResponse` to the browser
If the same PDF was previously uploaded (same `textHash`), `rag-api` returns the cached document — no re-embedding cost.
### 2 -- Match CV to a Single Job
1. Browser `POST /api/cv-matcher/match-job` with `{ cvDocumentId, jobUrl or jobDescription, email, gdprConsent, captchaToken }`
2. `api` verifies reCAPTCHA, forwards to `cv-matcher-api POST /api/cv/match-job`
3. `cv-matcher-api`:
- Fetches CV text from `rag-api GET /api/rag/document/{cvDocumentId}`
- Fetches and strips HTML from `jobUrl` via `JobTextExtractor` (or uses pasted `jobDescription`)
- Indexes the job text into `rag-api` (type = "job")
- Runs a semantic search against the RAG index to find matching job chunks
- Calls `ScorePairAsync` (LLM) to produce the structured match result
- Caches the result in `cvMatcher.CvMatchResults` by `(cvDocumentId, jobDocumentId)` hash
4. `api` (on return):
- If `email` was provided, creates a job search token via `IJobSearchApi.CreateTokenAsync`
- Sends match result email with CV PDF attached and job search link included
5. Returns `JobMatchResponse` to the browser
### 3 -- Find Jobs from RAG Index
1. Browser `POST /api/cv-matcher/find-jobs` with `{ cvDocumentId, topK }`
2. `cv-matcher-api` fetches CV text from `rag-api`
3. Builds a CV search profile string from the CV text
4. Calls `rag-api` semantic search against indexed jobs (`targetDocumentTypes: ["job"]`)
5. Takes top `DeepScoreTopN` results (default 5), runs `ScorePairAsync` LLM scoring on each
6. Returns `FindJobsResponse { jobs: JobMatchResponse[] }`
## LLM Scoring (`ScorePairAsync`)
Called for both match-job and find-jobs. Checks the DB cache first -- if a result exists for the same `(cvId, jobId)` pair it is returned immediately (no AI call).
If not cached:
- Truncates CV text to 18 000 chars, job text to 14 000 chars
- Takes up to 4 RAG evidence chunks (or first 4 000 chars of job text as fallback)
- Sends `system + user` prompt to the configured AI provider with `temperature = 0.2`
- Expects JSON response; falls back to a safe error object if parsing fails
- Persists the raw AI chat response in `cvMatcher.CvMatcherChatCache` by a hash of `(provider, model, temperature, systemPrompt, userPrompt)`
### Match Result Structure (`JobMatchResponse`)
| Field | Type | Description |
|-------|------|-------------|
| `score` | int 0-100 | Overall match percentage |
| `summary` | string | One-paragraph narrative |
| `strengths` | string[] | CV aspects that match well |
| `gaps` | string[] | Missing or weak areas |
| `recommendations` | string[] | Actionable advice for the candidate |
| `evidence` | string[] | RAG chunks that drove the score |
| `cached` | bool | True if returned from DB cache |
| `jobDocumentId` | string? | RAG document id of the indexed job |
| `jobUrl` | string? | Source URL of the job |
## JobTextExtractor
Extracts plain text from a job posting for the LLM prompt.
- If `jobDescription` (pasted text) is provided it is used directly -- no HTTP call
- Otherwise fetches `jobUrl`, strips `<script>`, `<style>`, and all HTML tags, decodes HTML entities, collapses whitespace
- Truncates to `MaxJobTextChars` (default 60 000, minimum 4 000)
- Throws `InvalidOperationException` if the extracted text is under 80 characters
User-agent sent: `MyAi.ro CV Matcher/1.0`. HTTP timeout: 25 seconds.
## AI Providers
Configured under `Ai:Provider` (`OpenAI` or `Ollama`).
| Setting | Default | Notes |
|---------|---------|-------|
| `Ai:Provider` | `OpenAI` | Switch to `Ollama` for local/offline |
| `Ai:OpenAI:ChatModel` | `gpt-4o-mini` | Any OpenAI chat model |
| `Ai:OpenAI:TimeoutSeconds` | `90` | Per-request timeout |
| `Ai:Ollama:BaseUrl` | `http://host.docker.internal:11434` | Local Ollama instance |
| `Ai:Ollama:ChatModel` | `llama3.1:8b` | Any Ollama chat model |
Both providers use `response_format: json_object` (or Ollama `format: "json"`) to guarantee parseable output. All AI responses are cached in the DB by content hash -- repeated identical prompts never hit the API twice.
## Caching
Two layers of caching in `cvMatcher` schema:
| Cache | Table | Key | What's stored |
|-------|-------|-----|---------------|
| AI responses | `CvMatcherChatCache` | SHA256 of full prompt + model | Raw JSON string from LLM |
| Match results | `CvMatchResults` | `(cvDocumentId, jobDocumentId)` | Full `JobMatchResponse` |
The match result cache means re-matching the same CV against the same job URL is instant and free.
## API Routes
### `api` (public, port 8080)
| Method | Route | Description |
|--------|-------|-------------|
| POST | `/api/cv-matcher/upload` | Upload CV PDF (multipart) |
| POST | `/api/cv-matcher/match-job` | Match CV to a job URL or pasted description |
| GET | `/api/cv-matcher/job-search/start?t=` | One-click job search start (token link) |
Rate limited by the `cvMatcher` policy: 10 requests / 10 minutes per IP.
### `cv-matcher-api` (internal, port 8082)
| Method | Route | Description |
|--------|-------|-------------|
| POST | `/api/cv/upload` | Index CV PDF into RAG |
| POST | `/api/cv/match-job` | Score CV against a job URL or text |
| POST | `/api/cv/find-jobs` | Find top jobs from RAG index for a CV |
| POST | `/api/cv/job-search/token` | Create job search token |
| POST | `/api/cv/job-search/token/{id}/start` | Validate token, create Pending session |
| GET | `/api/health` | Health check |
## Settings Reference
### `Matcher` section (`cv-matcher-api`)
| Key | Default | Description |
|-----|---------|-------------|
| `TopK` | `10` | RAG search result count |
| `DeepScoreTopN` | `5` | How many RAG results get LLM deep scoring |
| `MaxJobTextChars` | `60000` | Max job text length sent to LLM |
### `FileStorage` section (`api`)
| Key | Default | Description |
|-----|---------|-------------|
| `Path` | `Files` | Directory for cached CV PDFs (relative to app root or absolute) |
Shared via bind mount with `cv-cleanup-job` and `cv-search-job`.
## Match Email
Sent by `api` via SMTP after a successful match when `email` is provided.
- Subject: `MyAi.ro CV Match: {score}% -- {jobLabel}`
- Body: score, summary, strengths, gaps, recommendations
- Attachment: cached CV PDF from `{FileStorage:Path}/{documentId}.pdf`
- Footer: job search link (if token creation succeeded) — see [[Features/Internet-Job-Search]]
Sending is fire-and-forget: email failure does not affect the match result returned to the browser.
## Database Schema (`cvMatcher`)
Managed by `CvMatcherDbContext`. Migrations live in `Apis/cv-matcher-api/Migrations/`.
```powershell
dotnet ef migrations add <Name> --context CvMatcherDbContext --project Apis/cv-matcher-api
```