The PM's Guide to Working with ML Engineers

The PM's Guide to Working with ML Engineers | AI PM Portfolio

The PM's Guide to Working with ML Engineers

October 12, 2024 · 16 min read · Top 10 ROI Guide

The biggest bottleneck in AI product development is the communication gap between PMs and ML engineers. After three years managing ML teams across two companies, here are 10 high-ROI practices for PM-ML collaboration covering specs, performance reviews, and sprint planning.

Why is the PM-ML engineer relationship uniquely difficult?

PMs think in features, deadlines, and user outcomes. ML engineers think in distributions, metrics, and uncertainty. Neither is wrong, but the gap causes more delays than any technical challenge.

According to a 2024 Reforge survey, 71% of ML engineers reported misaligned PM expectations as the primary source of friction, ahead of data quality (54%) and infrastructure (41%). Teams with established PM-ML communication frameworks ship features 2.3x faster. At a YC-backed tax-tech startup and a $40M insurance-tech company, I witnessed the same communication failures consistently. Here are the 10 practices that fixed them.

What are the 10 highest-ROI PM-ML collaboration practices?

Practice 1: Spec in outcomes, not accuracy targets

The most common PM mistake is speccing an ML feature as "achieve 95% accuracy." This sounds precise but is almost meaningless. Accuracy on what data? Measured how? At what confidence threshold? Against which baseline?

Instead, spec in business outcomes: "Reduce manual data entry time by 60% for W-2 documents, measured by comparing average completion time before and after the feature launches, across a cohort of 500 users over 4 weeks." This gives the ML engineer flexibility to optimize what matters. Maybe 88% extraction accuracy with smart UI fallbacks achieves the 60% time reduction. Maybe 95% accuracy is needed. The engineer can explore the solution space. According to a 2024 Google ML product management guide, outcome-based specs lead to 40% faster iteration cycles compared to metric-target specs because engineers spend less time negotiating the metric and more time solving the problem. [LINK:post-30]

Practice 2: Understand the difference between "we need more data" and "we need better data"

When an ML engineer says "we need more data," PMs often hear "go collect 100,000 more samples." Sometimes that is right. But more often, the real need is better data -- more representative samples of edge cases, cleaner labels, or data that covers distribution gaps.

At the tax-tech startup, our model struggled with K-1 forms. The ML engineer said "we need more K-1 data." We had 1,200 samples but 80% came from the same 3 institutions. Getting 200 K-1s from 40 different sources improved accuracy more than 2,000 from the same 3. According to Stanford research, targeted collection for underrepresented distributions improves accuracy 5-8x more per sample than random collection.

Practice 3: Budget for experimentation sprints, not just delivery sprints

ML development is not like feature development. You cannot guarantee that two weeks of work will produce a shippable result. Sometimes the approach does not work and you have to try something different. PMs who plan ML work in traditional delivery sprints create an environment where engineers optimize for showing progress rather than finding the best solution.

Sprint Type Goal Success Metric Frequency
Experimentation sprint Test 2-3 approaches to a problem Clear recommendation with data 1 in every 4 sprints
Development sprint Build the chosen approach Working implementation 2 in every 4 sprints
Hardening sprint Production-readiness, monitoring, testing Deployed with monitoring 1 in every 4 sprints

This 1:2:1 ratio came from tracking actual time allocation across 18 months. Teams that tried to run all development sprints ended up with 30% of sprints producing throwaway work anyway -- the experimentation happened, it just was not planned for, leading to missed commitments and trust erosion. According to a 2024 MLOps Community survey, teams with dedicated experimentation time ship 1.8x more features per quarter than teams without it, because failed experiments are identified and abandoned faster. [LINK:post-33]

Practice 4: Learn to read a confusion matrix

You do not need to understand gradient descent. You do need to understand a confusion matrix. It is the single most important artifact in PM-ML communication because it answers the business question: "What kinds of mistakes is the model making, and what do those mistakes cost?"

Confusion matrix in 30 seconds: A 2x2 grid showing four outcomes: true positives (model correct, thing is there), true negatives (model correct, thing is not there), false positives (model says yes incorrectly), false negatives (model says no incorrectly). The PM's job is to assign business cost to each quadrant and help the ML engineer optimize the right tradeoff.

At the tax-tech startup, our document classification model had 93% accuracy. Sounds great. But the confusion matrix showed that false negatives -- failing to classify a document that was actually a W-2 -- happened 12% of the time for documents with non-standard formatting. Those false negatives caused users to re-upload documents, wasting their time and creating support tickets. False positives -- classifying a non-W-2 as a W-2 -- only happened 2% of the time and were caught by the extraction stage. The business cost of false negatives was 6x higher than false positives. We retuned the model to reduce false negatives at the expense of slightly more false positives, and user complaints dropped 40%. [LINK:post-32]

Practice 5: Define "good enough" before development starts

ML engineers are optimizers by nature. Without a clear "good enough" threshold, they will continue improving a model long past the point of diminishing returns. The PM's job is to define the minimum acceptable performance that unblocks the product launch, distinct from the aspirational target.

According to a 2024 Amplitude study on AI feature launches, 43% of AI features are delayed by more than one month because teams continue optimizing past the minimum viable accuracy. The average accuracy difference between the "good enough" and "shipped" versions was only 1.7 percentage points -- often not perceptible to users.

Practice 6: Attend model review sessions, not just sprint demos

Sprint demos show what was built. Model review sessions show how the model is performing on real data, where it is failing, and what the error patterns look like. These are the meetings where PMs learn the most and where their input on business context is most valuable.

I attended weekly model reviews for 14 months. My contribution was saying things like "that error pattern correlates with our highest-churn users" or "that failure mode triggers a regulatory report." Business context changed prioritization in 8 of 14 months. According to Harvard Business Review, PM participation in model reviews improves feature-market fit by 34%.

Practice 7: Create a shared glossary on day one

The word "accuracy" means different things to PMs and ML engineers. To a PM, it means "does this feel right to the user." To an ML engineer, it means a specific mathematical metric that may have nothing to do with user perception. Common ambiguous terms include:

Term PM Interpretation ML Engineer Interpretation Shared Definition Needed
Accuracy User perceives results as correct (TP + TN) / Total predictions Task-specific: extraction accuracy vs classification accuracy
Confidence How sure the system is Calibrated probability estimate Confidence = probability of correct answer, calibrated against actuals
Edge case Unusual user scenario Input outside training distribution Categorized list of known edge cases with expected behavior
Latency Time user waits Inference time only End-to-end latency including pre/post processing
Improvement Users notice it is better Metric moved on eval set User-facing metric (e.g., manual correction rate decreased)

Practice 8: Treat model versioning like product versioning

Models are not static. They get retrained, fine-tuned, and updated. Each version has different performance characteristics. PMs should track model versions the same way they track product releases -- with changelogs, performance comparisons, and rollback plans.

At the tax-tech startup, we maintained a model registry with performance benchmarks for each version. When model v2.3 showed a 2% accuracy improvement on our eval set but a 15% increase in latency, the PM decision was straightforward: the latency increase would hurt user experience more than the accuracy improvement would help. We shipped v2.2 to production and continued optimizing v2.3. Without version tracking, that tradeoff would have been invisible. [LINK:post-31]

Practice 9: Build feedback loops into the product, not as an afterthought

The most valuable data for ML improvement comes from users correcting the model's output. But correction mechanisms need to be designed into the product from launch. Retrofitting feedback loops is 4x more expensive than building them upfront because you have to redesign UX flows and backfill missing data.

We logged every user correction to AI-extracted values -- implicit feedback, not thumbs up/down. Over one season, 847,000 corrections drove a 4.2 point accuracy improvement. According to Google AI research, production feedback loops improve accuracy 2-4x faster per engineering hour than additional labeled training data. [LINK:post-24]

Practice 10: Celebrate negative results

In ML, an experiment that proves an approach does not work is a success. PMs who punish negative results get engineers who hide them. We instituted "Negative Result Fridays." The best one: an engineer spent a week proving GPT-4 fine-tuning produced worse results than our custom model, saving us from a 3-month migration that leadership was pushing. That saved $180,000. According to Nature Machine Intelligence, publication bias against negative results costs the ML industry $2-4 billion annually.

How do you review model performance together?

The performance review session is the most important recurring meeting between PMs and ML engineers. Our 45-minute weekly format: metric overview (5 min), error deep-dive on 3 samples with PM providing business context (15 min), distribution shift analysis (10 min), joint prioritization (10 min), and open questions (5 min).

This replaced ad-hoc Slack conversations and late-night "why is accuracy dropping" panics. According to a 2024 Lenny's Newsletter survey, teams with structured model review sessions report 52% fewer surprise degradation incidents. [LINK:post-35]

Frequently Asked Questions

Do PMs need to learn to code to work effectively with ML engineers?

No. Learn to read metrics, not write code. Interpreting a confusion matrix and understanding precision-recall tradeoffs is sufficient. Tools like Cursor can help you read the ML codebase without modifying it.

How should AI PM candidates demonstrate ML collaboration skills?

Show artifacts: an evaluation report you wrote, a model versioning decision, or a confusion matrix analysis that changed a product decision. Saying "we accepted 3% more false positives to reduce re-upload rate by 40%" beats any ML certification.

How does function calling in Claude 3 and GPT-4 change the PM-ML dynamic?

Tool use reduces the need for custom models in some scenarios. PMs increasingly spec prompt engineering and tool integration instead of model training. The collaboration practices still apply -- shared metrics, experimentation sprints, error analysis -- but artifacts look different: prompt templates instead of model architectures.

What is the biggest mistake new AI PMs make?

Treating ML development like feature development. In ML, 80% of the work might produce nothing if the approach is wrong. PMs who apply traditional project management systematically underestimate timelines and create pressure that leads to shipping undertested models.

Should PMs use AI coding tools to understand the ML codebase?

Yes. Using Cursor to read data pipelines and trace how data flows from input to prediction makes you a better collaborator. Two to three hours per month compounds into real architectural understanding.

Published October 12, 2024. Based on managing ML teams at a YC-backed tax-tech startup and a $40M insurance-tech company, 2021-2024.