Webinar in review: MLOps to reduce friction, technical debt, and modeling

Michael McCourt and Barrett Williams
All Model Types, Augmented ML Workflow, Experiment Management, Hyperparameter Optimization, Modeling Best Practices, Training & Tuning

Establishing a robust pipeline for successful models requires more than a Jupyter notebook and a GPU instance. It requires a standardized methodology, across feature engineering, training and tuning, as well as serving and monitoring models in production.

This past Wednesday, SigOpt Head of Research, Michael McCourt, joined Kevin Stumpf, CTO of Tecton, and Eero Laaksonen, CEO of Valohai, to discuss the entire MLOps process, and how each company plays a strategic role in systematizing the modeling lifecycle. You can find the video recording of the panel discussion here.

Here’s a quick overview of the discussion:

  • SigOpt introduction on sample-efficient optimization and orchestration at scale(1:45)
  • Importance of data transformation and feature engineering (4:07)
  • MLOps platform for exploration, orchestration, and versioning in production (6:31)
  • Using Valohai for the NeurIPS optimization competition (6:38)
  • Best practices for data, standardization, and performance (8:52)
  • Artifacts required for ML (9:30)
  • Data catalog, permissions, legal constraints, garbage rows, aggregations (11:20)
  • Query latency versus fast serving requirements; the genesis of train-serve skew (13:47)
  • In larger companies, a hand-off is required to bring model to production; differing agendas and delays (14:38)
  • Summary of the Valohai survey that gathered data on the maturity of the systems in place for various ML teams; advanced teams acknowledged the flow of data (15:07)
  • Pros and cons: SigOpt intentionally doesn’t access customer training data (16:48)
  • Model structure, origin, and other metadata can be as important as training data (19:51)
  • Technical debt between research and production, a “high interest credit card” from Google Research (22:11)
  • Different processes, still, for structured and unstructured data (24:51)
  • Data warehouses (and Spark) are still better suited for structured and semi-structured data, whereas data lakes house unstructured data (26:55)
  • MLOps challenge for larger companies is hiring; why not abstract this function out to a third-party? (30:09)
  • As you adjust feature definitions, or accuracy changes in production, you should trigger new ETL work automatically with orchestration software, compare with historical features (32:21)
  • Speed and consistency matter, particularly around documentation for each model and dataset, and the importance of parallelizing your workflow to ensure speed (35:25)
  • 6-12 months won’t work for most teams, and, as a result, you need several orders of magnitude of speed-up (39:14)
  • Hero coder role, old ways of working, they’re both dead. Training times of multiple weeks is not a good idea (41:01)
  • On using shortcuts versus just speeding up the best practice process itself: An example from the field was a customer who held most parameters constant while adjusting one; it worked, sort of, but Bayesian optimization across all parameters worked better (42:31)
  • Old workflow, one data scientist handing off Python code to a Java DevOps engineer for production (45:07)
  • Case study from a user: $20,000 wasted over a weekend for no training! (46:02)
  • It’s hard to spin up a GPU cluster from Jenkins; CI/CD concepts are relevant but new tools are necessary (47:20)
  • Standardization: where to allow flexibility, but also enable the organization to grow? Go from hero developers into exponential teams (50:45)
  • Technical debt via churn of data scientists, and the need to address this with standardization (53:25)
  • Reduce flexibility by 10%, with a model zoo, and get your model into production orders of magnitude faster (55:45)
  • Can new employees reproduce your work? Or is one-off code too chaotic to even use again? (58:20)
  • Performance in production: it’s important to define, agree on, and then optimize against metrics (59:52)
  • Early in the process, a junior data scientist builds lots of models; now he asks management why (1:01:23)
  • Once you have a robust process, the data often breaks rather than the model, as the world changes (1:03:00)
  • You need feature monitoring to see issues in subpopulations (Finland versus US, for example) (1:04:45)
  • Recommended techniques to addressed model or data drift, you need right modeling tools, and it’s not trivial to trigger retraining (1:08:35)

If you’d like to read the full eBook, authored by SigOpt, Valohai, and Tecton, you can find it here. If you’d like to try your hand at optimizing, tracking, and orchestrating your model training and tuning with SigOpt, be sure to sign up here.

MichaelMcCourt
Michael McCourt Research Engineer
Barrett-Williams
Barrett Williams Product Marketing Lead