Noodle
NetworkStudent Journey Co-Pilot · 40 program agents · Langfuse × Golden Evals
Agent Engagement — 30 days
Live dataDaily engagement share across 40 program agents. Higher = larger slice of network turn volume on that touchpoint.
Program Agent Leaderboard
Engagement share across all 4 touchpoints · last 24h
- 1Online MBA - FlagshipYou7.8%0.1
- 2Online MBA - East Coast3.3%0.1
- 3Master of Education3.3%0.1
- 4Online MBA - Southern3.1%0.0
- 5Online MBA - West Coast3.1%0.2
- 6Executive MBA2.9%0.1
- 7Data Science Bootcamp - Cohort 12.8%0.5
- 8MPH - Epidemiology2.8%0.2
- 9MS in Data Science2.7%0.4
- 10Master of Public Health2.7%0.7
- 11RN to BSN2.7%0.1
- 12MEd - Educational Leadership2.6%0.2
- 13Online JD2.6%0.2
- 14MEd - Curriculum & Instruction2.6%0.1
- 15MS in Management2.5%0.2
- 16MSW - Clinical Practice2.5%0.2
- 17MS in Analytics2.5%0.3
- 18MS in Finance2.4%0.2
- 19MS in Business Analytics2.4%0.1
- 20MPH - Global Health2.4%0.1
- 21MS in Instructional Design2.3%0.2
- 22Master of Social Work2.3%0.8
- 23MS in Strategic Marketing2.2%0.2
- 24MS in Software Engineering2.2%0.1
- 25MSN - Nurse Leadership2.2%0.1
- 26MSN - Family Nurse Practitioner2.1%0.2
- 27MS in Higher Education Admin2.0%0.0
- 28MS in Electrical Engineering2.0%0.1
- 29Online LL.M.2.0%0.0
- 30MS in Information Systems2.0%0.1
- 31MSW - Macro Practice2.0%0.0
- 32MS in Health Services Administration2.0%0.1
- 33MEd - Special Education1.9%0.1
- 34MS in HR Analytics1.9%0.0
- 35MS in Cybersecurity1.9%0.0
- 36MS in Healthcare Analytics1.9%0.2
- 37MS in ML Engineering1.8%0.2
- 38MSN - Nursing Informatics1.8%0.0
- 39MS in Civil Engineering1.8%0.3
- 40MSW - Advanced Standing1.7%0.0
Active Alerts
Auto-detected from Langfuse traces, golden-set evals, red-team sessions, and SLA monitors
MPH Flagship - hallucination spike on Learning agent (rolled back)
LearningAfter the May 24 model rotation, the MPH Learning agent's hallucination rate jumped from 1.2% to 4.8% on case-study coursework prompts (Langfuse golden-set eval, n=120). Pattern: invented citations to non-existent CDC reports. Pinned the agent back to the prior prompt+model version. Root cause review queued; suspect a tool-call format regression on the new model.
MSW Flagship Support - three-way chat queue depth growing
SupportMSW Support routes to a campus advisor for high-sensitivity cases (mental health, accommodation, Title IX). Advisor SLA on the three-way chat slipped from 12 min to 47 min over the last 10 days. Best guess: a staff turnover on the campus side. Agent is correctly handing off but the human is no longer there in time. Escalated to the program lead.
3 programs have prompt-version regressions awaiting review
Pre-deploy eval suite flagged regressions on MS in Cybersecurity (tone shift in adversarial probes), MEd Special Education (fallback handling dropped on out-of-scope queries), and MS in Finance (longer responses than the rubric allows). Blocking deploy until reviewed.
Online MBA Flagship crossed 50K trace milestone in Langfuse
EnrollmentFlagship MBA agent surpassed 50,000 traces in Langfuse this week with a sustained 96.2% acceptance rate on the golden eval set. The longest-running deployment is now the most stable - prompt maturity compounds. Time for a case-study writeup for the partner success deck.
Red team flagged fallback weaknesses on 2 programs
Internal red-team session surfaced reproducible fallback failures on MSN-Nurse Leadership (agent answered when it should have escalated to faculty) and Online JD (provided generalized legal commentary outside the program's scoped knowledge base). Fixes in progress; new fallback rubric being added to the eval suite.
Data Science Bootcamp Cohort 1 - first-cohort volatility expected
The DS Bootcamp agent (deployed 38 days ago) is showing day-over-day swings of plus or minus 18 percent across all touchpoints. Pattern matches the first-90-day shakedown across every new program. Suppressing volatility alerts on this agent until day 90. Quality metrics are within range.
Recent Inquiries
Student inbound · program routing highlighted · 40 inquiries tracked total
Routed to the MS in Civil Engineering Learning agent. Pulled the module 4 syllabus plus the relevant case-study appendix from the knowledge base, walked through the DCF setup, and queued a TA office-hour link if the learner needs more.
Online MBA - West Coast Support agent looked up the relevant policy and offered to open a ticket with the registrar. No financial-aid triggers fired (this drop wouldn't affect their aid window). Escalation path queued in case of follow-up.
MS in Management enrollment agent surfaced the holistic-review policy, recommended bringing GRE or recent quantitative coursework as a strengthening signal, and offered to set up a 15-minute fit call with an enrollment counselor. Agent flagged for human review (admissions-policy sensitive).
Faculty Assist (three-way chat) returned the paper title plus DOI plus a one-paragraph summary, then asked the professor to confirm the citation before sending to the learner. Professor approved.
All Program Agents
40 agents · engagement across 4 touchpoints · 30-day trends
Inquiry Templates
10 canonical inquiry types · sampled across 4 touchpoints · 40 captured agent replies
What I'd Want to Dig Into in 30 / 60 / 90
7 pillars · 70 starting hypotheses · anchored against the public JD. Click any pillar to expand.
📝Prompt Portfolio Build-Out
Phase 1Days 1-45Every program agent has a versioned, tested prompt for each touchpoint (enrollment, learning, support, faculty assist). Personas, scope constraints, fallback handling, and tone are explicit and documented per program.
Prompt Portfolio Build-Out
Phase 1Days 1-45Every program agent has a versioned, tested prompt for each touchpoint (enrollment, learning, support, faculty assist). Personas, scope constraints, fallback handling, and tone are explicit and documented per program.
Write, iterate, and maintain system prompts and instruction sets for Noodle's AI agents across the student journey.
- 1Audit the existing prompt library across all program agents and tag each one by maturity (production, beta, experimental, deprecated).
- 2Define a standard prompt structure - persona, scope, tone, fallback, escalation paths - and migrate the top 10 prompts to the standard first.
- 3Build a per-program persona guide co-authored with program leads (voice, formality, jargon density, when to escalate).
- 4Map the canonical inquiry types per touchpoint - the 10-20 questions that cover most volume - and ensure every program has a tested answer.
- 5Multi-turn conversation logic: explicit branching for ambiguity, clarification, and graceful exit conditions.
- 6RAG-augmented prompt patterns that draw from program-specific knowledge bases (syllabi, policies, alumni stories).
- 7Few-shot example library curated per program category so new programs get a head-start, not a blank slate.
- 8Output-format constraints (length, tone, citation requirements) encoded in prompts and verified by evals.
- 9Sensitive-topic guardrails (mental health, accommodation, Title IX, financial aid) with explicit human-routing logic.
- 10Quarterly prompt-portfolio review: which prompts are still in production, which got retired, which need a rebuild.
🎯Eval Framework + Golden Sets
Phase 1Days 1-60Every prompt change runs against a versioned golden set before deploy. Accuracy, tone, hallucination rate, task completion, and rubric alignment are measured and surfaced as a blocking quality gate.
Eval Framework + Golden Sets
Phase 1Days 1-60Every prompt change runs against a versioned golden set before deploy. Accuracy, tone, hallucination rate, task completion, and rubric alignment are measured and surfaced as a blocking quality gate.
Build and maintain evaluation frameworks to measure agent accuracy, tone, hallucination rate, task completion, and alignment with rubric-based learning objectives.
- 1Define the eval rubric per touchpoint - what does 'good' mean for an Enrollment agent vs. a Learning agent vs. Faculty Assist.
- 2Build a golden test set per program (20-100 inputs each) with reference outputs and tolerance bands, co-authored with learning designers.
- 3Implement a model-graded eval for tone + persona adherence, with a human-graded random-sample audit weekly.
- 4Hallucination detection: structured citation requirements + automated source-validation in the eval pipeline.
- 5Pre-deploy gate: any prompt change blocks if it regresses on >5% of the golden set; auto-PR review with diff summary.
- 6Per-program rubric calibration with the learning team - the same answer might be A+ for an MBA agent and a C for a Public Health one.
- 7Continuous eval: nightly runs against a sampled production trace set, regressions paged to Slack within an hour.
- 8Eval portability: rubrics expressed as code so they can be cloned, versioned, and audited like any other artifact.
- 9Adversarial eval set maintained alongside the happy-path set: jailbreak attempts, ambiguous queries, hostile users.
- 10Quarterly eval-framework review: what's catching regressions, what's missing, what's too noisy to act on.
📡Langfuse Observability + Alerting
Phase 1Days 30-90Every agent turn is traced. Quality regressions, hallucination spikes, latency drift, and unusual user patterns surface as actionable alerts before users complain.
Langfuse Observability + Alerting
Phase 1Days 30-90Every agent turn is traced. Quality regressions, hallucination spikes, latency drift, and unusual user patterns surface as actionable alerts before users complain.
Use Langfuse to monitor prompt performance in production, identify regressions, and prioritize prompt improvements.
- 1Standardize Langfuse trace metadata across all program agents (program_id, touchpoint, prompt_version, model, tool calls, latency).
- 2Per-program dashboards: acceptance rate, escalation rate, top failure modes, p50/p95 latency, cost per resolved turn.
- 3Anomaly detection: hallucination-rate spikes, fallback-rate spikes, sentiment dips, queue depth growth - all paged with reasonable thresholds.
- 4Sample-based human review queue for high-stakes touchpoints (accessibility requests, mental health flags, admissions policy).
- 5Trace-to-eval linkage: every trace tied to the prompt version + golden-set score that was in production at that moment.
- 6Cost tracking per program per touchpoint, with budget alerts before overruns become headline-grabby.
- 7Latency budget per touchpoint encoded as an SLO with error-budget burn alerts.
- 8Weekly observability digest auto-drafted from the dashboard for partner success teams.
- 9On-call rotation for prompt-systems incidents with a documented runbook per common failure mode.
- 10Quarterly observability retro: what alerts fire too often, what's silent that shouldn't be, what's our mean-time-to-detect.
🛡️Red-Team + Adversarial Testing
Phase 2Days 60-150Failure modes are surfaced before students find them. Jailbreaks, scope drift, sensitive-topic mishandling, and out-of-scope confident answers are caught in red-team sessions and fixed at the prompt layer.
Red-Team + Adversarial Testing
Phase 2Days 60-150Failure modes are surfaced before students find them. Jailbreaks, scope drift, sensitive-topic mishandling, and out-of-scope confident answers are caught in red-team sessions and fixed at the prompt layer.
Design red-teaming and adversarial testing protocols to surface edge cases and failure modes before agents reach students.
- 1Define the red-team rubric: what categories of failure we test for (jailbreak, scope drift, sensitive-topic, hallucinated authority, persona drift, etc.).
- 2Quarterly red-team session per program agent with a rotating internal team (engineers + learning designers + partner stakeholders).
- 3Automated adversarial prompt-generation seeded by recent production failures and external LLM-jailbreak research.
- 4Sensitive-topic test suite: every program's Support agent must correctly route mental health, accommodation, and Title IX inquiries.
- 5External red-team partnership annually with a vendor that doesn't have our priors - fresh eyes find what we've stopped seeing.
- 6Public bounty channel for university partners to surface failure modes their staff have caught in the wild.
- 7Red-team findings versioned as eval cases - every reproducible failure becomes a permanent test in the golden set.
- 8Pre-deploy red-team gate on high-stakes prompt changes: a small panel signs off on persona + scope changes before production.
- 9Post-incident writeups blamelessly documenting any production failure that bypassed the eval suite, indexed and searchable.
- 10Annual red-team report: what we learned, what got fixed, what failure classes are still unsolved.
👥Three-Way Chat + Multi-Agent Workflows
Phase 2Days 90-180The Faculty Assist surface - learner + campus/Noodle staff + that staff member's AI assistant in the same thread - works smoothly and scales beyond pilot programs. Multi-step chained agent workflows handle the cases a single prompt can't.
Three-Way Chat + Multi-Agent Workflows
Phase 2Days 90-180The Faculty Assist surface - learner + campus/Noodle staff + that staff member's AI assistant in the same thread - works smoothly and scales beyond pilot programs. Multi-step chained agent workflows handle the cases a single prompt can't.
Create the learner experiences defined by 'three-way' chat between the learner, a campus or Noodle support staff member, and that staff member's AI assistant. Design prompt architectures for multi-step and chained agent workflows.
- 1Define the three-way chat protocol: who speaks when, what context the AI sees, what context it doesn't, escalation triggers.
- 2AI-assist persona for staff: drafts replies, surfaces relevant policies, flags sensitive topics - but the staff member ships the message.
- 3Latency budget for the assist surface: staff are watching live, so the AI has milliseconds before it becomes noticeable.
- 4Audit trail: every staff-AI co-authored message is logged with the AI's draft + the staff edit diff for quality review.
- 5Chained workflows: complex inquiries that require multi-step retrieval (policy lookup + transcript review + financial-aid check) handled by orchestrated sub-agents.
- 6Hand-off design: when a single agent should escalate to the three-way chat vs. directly to a human-only ticket.
- 7Multi-turn memory boundaries: what state persists, what resets, what gets pinned by the staff member.
- 8Cross-program transfer: a learner asking about transferring credits across two programs needs an agent that holds both contexts.
- 9Failure-mode planning: what happens when the staff member is offline mid-thread, when the AI is uncertain, when the learner disengages.
- 10Quarterly three-way chat review with the campus staff who use it daily - usability + trust score + what they'd kill.
🧱Reusable Prompt Components + Internal Standards
Phase 2Days 90-180Prompt patterns, scoring rubrics, fallback templates, and persona scaffolds live in a shared library. New programs onboard in days, not weeks. Internal teams configure their own agents against the same standards.
Reusable Prompt Components + Internal Standards
Phase 2Days 90-180Prompt patterns, scoring rubrics, fallback templates, and persona scaffolds live in a shared library. New programs onboard in days, not weeks. Internal teams configure their own agents against the same standards.
Establish prompt versioning practices and maintain a library of tested, reusable prompt components. Contribute prompt engineering guidelines and best-practices documentation for internal teams who configure their own agents on Noodle's AI orchestration platform.
- 1Build the prompt-component library: persona blocks, fallback handlers, escalation routers, citation formatters, tone guards.
- 2Versioning convention: semver for prompts (breaking persona shifts vs. tone refinements vs. typo fixes) and changelogs co-located with the component.
- 3Internal documentation site: prompt patterns, anti-patterns, the eval rubric, the red-team rubric, runbooks.
- 4Self-serve agent configuration playbook: 'I'm a Noodle PM, how do I spin up a new program agent against our standards?'
- 5Lint-style prompt checker: surfaces missing fallback paths, missing scope guards, missing escalation routes before deploy.
- 6Per-component eval suite: every reusable block has its own micro golden set so library upgrades are safe.
- 7Internal office hours for teams configuring their own agents on Noodle's platform - lower the lift for non-engineers.
- 8Library deprecation lifecycle: components don't live forever; sunset path is documented.
- 9Reusable evaluator library too: tone classifiers, hallucination detectors, sentiment graders shared across programs.
- 10Quarterly library review: what got used, what got cloned (signal it should be standardized), what got ignored.
🎙️Voice + Multimodal Extension
Phase 3Days 180-365The same agent stack drives voice interfaces (phone enrollment, study-mode tutoring) and multimodal inputs (a learner pastes a screenshot of an assignment question). The prompt + eval + observability stack carries over.
Voice + Multimodal Extension
Phase 3Days 180-365The same agent stack drives voice interfaces (phone enrollment, study-mode tutoring) and multimodal inputs (a learner pastes a screenshot of an assignment question). The prompt + eval + observability stack carries over.
Experience with voice agents or multimodal prompting (preferred).
- 1Voice-first prompt patterns: shorter sentences, conversational repair, explicit confirmations on action-taking turns.
- 2Phone-enrollment companion pilot with one program: prospective student calls in, voice agent handles intake + scheduling.
- 3Multimodal inquiry handling: a learner pastes a screenshot of a homework question and the Learning agent uses vision + text together.
- 4Transcript-quality monitoring: the new failure mode is ASR mistakes, not LLM mistakes - separate eval surface for that.
- 5Voice-specific eval rubric: latency under 800ms perceived, no awkward silences, graceful fallback when the user goes off-script.
- 6Per-program persona-as-voice: the persona is also a voice - tone, pace, warmth - chosen with the program lead.
- 7Multimodal red-team: jailbreak via image, prompt-injection via screenshot text, all the new attack surfaces.
- 8Latency engineering: streaming partial responses for voice, smart preloading of likely follow-ups.
- 9Cross-modal hand-offs: voice agent escalates to a chat thread with context preserved.
- 10Annual review of the voice + multimodal surfaces - is the same agent + eval stack still the right architecture, or do they fork.