Agent Reliability ExpertCompany: VinSmart Future (VSF)On-siteGia Lam, Ha Noi, Vietnam

Overview

Application

Company Description

VinSmart Future (VSF), a core technology company of Vingroup, is driven by a mission to shape Vietnam's digital future and enhance lives through innovative solutions. Formed through the integration of Vingroup's technology ecosystem, including VinApp, VinIT, VinBigData, and others, VSF develops unified technology platforms for Vingroup and its partners. The company focuses on providing safe, convenient, and seamless digital experiences. By joining VSF, you'll collaborate with leading technology experts from Vietnam and beyond to create impactful advancements that simplify life.

Mission. Make a non-deterministic system feel deterministic to the enterprise buyer. Own the evidence that an agent did what it was supposed to do, and the regression machinery that prevents silent decay.

What they own. Trace-level observability across every agent run; eval design (offline goldens, online shadow runs, model-graded eval with calibration, human review where it matters); per-skill SLOs and the dashboards leaders actually look at; regression gates that prevent a skill from being promoted to production data or actions until it has earned that promotion; a failure-mode taxonomy specific to enterprise agent work; the cost/latency/quality tradeoff story; and the measurement layer that proves the agent moved a real business outcome.

Sourcing filters.

6+ years total, with at least 2 years in some combination of: ML evaluation/quality, production ML observability, or SRE/reliability for ML-powered products. At least 1 year of that focused specifically on LLM systems.
Strong data engineering instincts. Comfortable in SQL and Python; has built non-trivial data pipelines for evaluation datasets, trace analysis, or production telemetry.
Has worked with OpenTelemetry-style tracing or equivalent. Has integrated at least one LLM observability / eval platform in production (any of the common ones — the specific vendor is not what we're filtering on, the depth of use is).
Has owned a real on-call or quality-incident response process for an LLM product.
Statistical literacy adequate for designing and interpreting evals: sampling, confidence intervals, A/B testing, inter-annotator agreement. Doesn't need to be a research scientist.

Strong signals.

Has built, not just bought, an eval system for a multi-step agent in production. Can describe their dataset construction strategy, how they calibrated their judges, and the specific time a regression caught itself before customers saw it — and the time one didn't, and what changed afterward.
Knows the limits of model-graded eval cold (bias, position effects, self-preference, brittleness to rubric phrasing) and has opinions on when to use it, when to use humans, and when to use both.
Has worked on trajectory evaluation, not just final-output evaluation. Has thought about what "correct" means when there are many valid paths to the same outcome.
Background blends ML evaluation thinking with production reliability instincts. Can write the runbook for "agent quality dropped 4% overnight, what now" and has done it.
Ships eval infrastructure incrementally, alongside the product, not as a six-month platform project. The first useful version was in place within weeks of needing it.
Pragmatic about tools — picks them for the problem, swaps them when they stop fitting.

Anti-signals.

Classical ML eval only (accuracy, F1, ROC) with no agent-system experience.
"Vibe checks" or hand-curated demo sets as the primary evaluation method.
Has never instrumented a real production trace end-to-end.
Believes a single benchmark number tells the story.
Treats observability as logging.
Wants to build the perfect eval platform before measuring anything.