Data Science & ML

Getting a Machine Learning Model Out of a Notebook and Into Production

Aetherisys Consulting #machine-learning#mlops#production

A working notebook is a proof of concept, not a product. It tells you the signal exists and the maths holds. It says nothing about whether the model can be retrained next month, whether it will return a prediction in 40 milliseconds, or whether anyone will notice when it quietly starts being wrong.

That gap — between a model that works on your laptop and one that works in production — is where most data science effort goes to die. Here is how to close it without pretending the hard parts away.

The notebook is not the deliverable

The first move is psychological. The notebook is a record of an experiment. The deliverable is a system: a way to train the model reproducibly, a way to serve its predictions, and a way to know when it breaks.

Stop editing the notebook. Pull the logic into plain Python modules — a features.py, a train.py, a predict.py. Notebook cells encourage hidden state and out-of-order execution; modules force you to make data flow explicit. Once the logic is in functions with typed inputs, you can test it, import it, and run it on a schedule. The notebook becomes what it should have been all along: a place to explore, not a place to ship from.

Reproducible training

If you cannot rebuild the exact model you have in production, you do not have a model — you have an artefact you got lucky with once.

Reproducible training means three things are pinned and versioned:

  • Data — the snapshot the model trained on, not “the table as it is today”. A query against a live database is not a dataset.
  • Code — a specific commit, including the feature transformations.
  • Environment — locked dependency versions. A requirements.txt with unpinned packages will silently retrain a different model in six months.

Capture the training run as an experiment: parameters, metrics, and the resulting model file, all tied to that data snapshot and commit. Tools like MLflow or DVC handle this, but the discipline matters more than the tool. The test is simple — can a colleague reproduce your numbers from a clean checkout? If not, fix that before anything else.

Feature pipelines and the training–serving gap

The most common production failure is not a bad model. It is a feature computed one way during training and a different way during serving.

In the notebook you might do a groupby over the full history to build a “customer 30-day average”. At serving time that history is not sitting in a dataframe — it is in a database, and you need the value for one customer, now. If the two code paths diverge, the model sees inputs it was never trained on, and accuracy drops with no error in the logs.

The fix is to define each feature transformation once and call it from both paths. For real-time models this is what a feature store gives you. For many businesses it is enough to put the transformations in a shared module and be ruthless about both sides importing it. Either way: one definition, two callers.

Serving: batch or real-time

Decide deliberately. The two modes have very different costs.

Batch suits predictions that do not need to be instant — churn scores, lead rankings, nightly forecasts. You run a scheduled job, write predictions to a table, and the application reads them. It is dramatically simpler: no latency budget, no always-on service, easy to retry. Most “we need ML” problems are actually batch problems. Reach for it first.

Real-time suits predictions tied to a live user action — fraud checks, search ranking, recommendations. Here you wrap the model in a service behind an HTTP or gRPC endpoint:

@app.post("/predict")
def predict(payload: Request) -> Response:
    features = build_features(payload)   # the shared transformation
    score = model.predict(features)
    return Response(score=score, model_version=MODEL_VERSION)

Return the model version with every prediction. When something looks wrong in three weeks, you will want to know exactly which model produced which output.

Monitoring for drift

A deployed model degrades. The world it was trained on moves on, and nothing in your stack will tell you unless you ask.

Monitor three layers:

  • Operational — latency, error rate, throughput. Standard service health.
  • Input drift — has the distribution of incoming features shifted away from training data? This is the early warning, because it shows up before accuracy does.
  • Prediction quality — once ground truth arrives (a customer churned or did not, a transaction was fraud or not), compare it against what the model said.

Ground truth often lags by days or weeks. Build the pipeline that joins predictions to outcomes anyway — it is the only honest measure of whether the model still earns its place.

Evaluating in production

Offline metrics are a hypothesis. Production is the test. Before a model influences real decisions, run it in shadow mode — making predictions on live traffic, logging them, affecting nothing. You will catch latency surprises and feature mismatches with no business risk.

When you do go live, do it as a controlled rollout: a small percentage of traffic, an A/B test against the existing approach, with a clear metric and a clean rollback. “We replaced the model on Tuesday” is not a deployment strategy.

When not to use ML

The most senior decision is sometimes to not ship a model at all.

If a handful of business rules get you 90 per cent of the value, ship the rules. They are transparent, debuggable, and need no monitoring infrastructure. If you cannot get reliable labels, or the cost of a wrong prediction is high and unaccountable, ML adds risk without adding clarity. A model is a system you have to feed, watch and retrain — that ongoing cost is only worth paying when the problem genuinely needs a learned function.

The shortest honest path

Getting a model out of a notebook is mostly engineering, not modelling: extract the logic into tested modules, make training reproducible, share feature code across training and serving, pick batch unless you genuinely need real-time, and monitor for drift from day one. None of it is glamorous. All of it is what separates a model that ships from a notebook that impresses in a meeting.

This is the work our data science and ML practice does — taking models that work and making them production systems that keep working.

If you have a model stuck in a notebook and no clear route to production, get in touch. We will help you find the shortest honest path.