Tracking and documenting an ongoing machine learning project is a task onto itself. The following is a starting document of all the moving parts involved in a machine learning project, specializing towards a specific application such as a recommendation system or a ranking capability etc.,. A simple forkable copy of this document is available at https://github.com/sengopal/ml-project-template as well.

Model and Research Documentation Template

Project Root

This document acts as the index README or the landing page for the machine learning project. The intent is to capture all the necessary decision making information and related references in one centralized document.

Versions

<next version> - <what change happened and which section>
v1.0 - created on Sept 30, 2023

Audience

A simple list of interested folks. This section can also use RASCI (Responsible, Accountable, Supporting, Consulted and Informed) structure as well as necessary. This section can also include the external team stakeholders etc.,

1 Goals and Definitions

1.1 Business Objectives

1.2 Vision

1.3 Impact Metrics

These are the output business metrics that are targeted for improvement. For a recommendation sytem this might be Null&Low queries, MRR, Conversion etc., It is important to identify these metrics, though the models may not be directly optimized only for these metrics. These are not the model metrics such as precision or recall, which will be tracked in model training section.

1.4 Project Scope

This section defines the scope of execution impact such as mobile/desktop, geographies planned, experiments identified, user segments targeted etc.,

1.5 Usecases

Applications and usecases identified to utilize this feature/model and the method of consumption.

1.6 Opportunity Sizing Analysis

This section captures the opportunity sizing for each usecase planned. This identifies the approximate improvements in the input and output metrics with reasonable assumptions.

1.7 Current Baseline

This section describes the current status of the opportunity output metrics that is being identified. These act as the baseline to measure and experiment the model improvements and other hypothesis.

1.8 Data reporting/business intelligence dashboards

2 Data Analysis

2.1 Data used

Location of the training/validation/test data, data freshness, SQLs/Hadoop jobs used to create the data.

2.2 Data Analysis

Exploratory analysis of the data being used - their distributions, any missing data, biases and methods to prevent them. This section also documents any interesting relationships observed in the data.

2.3 Data loading Jobs

ETL jobs for data loading and related transformations/conversions

2.4 Labeled Data / Human annotated datasets

Human annotated datasets that act as golden datasets for final model performance evaluation - their exploratory analysis and locations

3 Machine Learning Model

3.1 Baselines for model performance

These are the baselines established for model finetuning based on either off the shelf model weights or any other reference models to be used as a proxy for the downstream tasks.

3.2 Literature review

This section captures any literature review performed to determine the variations of models to be experimented, their related notebooks etc.,

3.3 Model

<X> indicates the model variation tracking. These might be either numbered or a simple identifier can be used as well.

Model architecture, pretrained weights used, finetuning dataset used (use reference to section 1.2) 1. Find SoTA model for your problem domain (if available) and reproduce results, then apply to your dataset as a second baseline. 2. Track Training methods and metrics (Losses, epochs etc.,) 3. MLflow or CometML for Training and hyperparameters tracking 4. Model checkpoint locations 5. Model training and inference timings 6. Hardware configuration used 7. Dependencies/Libraries - requirements.txt or a docker image 8. Performance vs. Latency tradeoffs 9. Model export formats and comparison metrics (eg., ONNX or TF-protobuf) 10. Training improvements (quantizations, smaller model or dimensions) and comparison metrics

3.4 Model Evaluation

Evaluation Metrics - training, validation and testing
Experiments conducted and results
Hyperparameter experiments and final parameters identified
Streamlit / Gradio Demos
Model Card - https://modelcards.withgoogle.com/face-detection
Github location for the evaluation notebooks

3.5 Inference Deployments

Pytorch - code format and styleguide - https://github.com/IgorSusmelj/pytorch-styleguide#recommended-code-structure-for-training-your-model

4 ML Operations

This section documents the ML operation pipelines once the model has been identified. There should be sub segments created for the below and track all the information necessary as listed below.

Data pipeline architecture - landing page to understand the data flow and dependencies
Infrastructure diagrams
- for offline batch inference - kafka topics, downstream identifiers, capacity estimates, frequency of updates
- for online inference - APIs, platform used, capacity (throughput and latency) and cluster size
Other Integration specific system dependencies
Various modes of operation and specifics - Batch (Offline), Batch (Online), Realtime
Airflow, Luigi or any other orchestration
Any additional post processing - vector databases, indexing, Quanitizations etc.,
Code changes and Deployments - Source code, K8s pods
Instructions for retraining, bulk inferencing etc.,

5 Project Execution and Rollout Plan

Ideally this document should not include execution status which tracks the current status which is updated very frequently and represents current state of affairs. there are other project management tools to do the same.

This section tracks the intended end state of the model execution and can also track incremental phases. Phase - Timelines, A/B test, dependency timelines, Github, Inference endpoints, cURL commands, notebook links, Jenkins URL, Architecture diagrams, reference to Model (refer to 3.3)

6 Outcomes and Monitoring

This section document the outcomes and results including the experiments, the variants tested and observations. There should be subsegment for results of each A/B Test variant, the impact and their guardrail metrics. This section also should document the observations, further model inputs and clearly indicate what is the outcome of each experiment.

6.1 Monitoring

This section tracks the links to monitor the deployed system and data health. 1. System Monitoring - for throughput, system health, response time etc., 2. Data Monitoring - coverage, data drift, model metrics etc.,

6.2 Playbook for FAQs and commonly known issues

1.7 References

All other reference links such as 1. Internal documents 2. Refs to wiki, screenshots, Repos with any sample code 3. External - Inspiring work, papers for further literature review

Senthilkumar Gopal

ML Project Template