Skip to content

Resources

Property Resources

Properties are characteristics of the trained model, the procedure used to train it (including training data), or its ability to perform inference. This section gives more detail and background on each property within MLTE. The properties are organized into three categories:

  • Functionality
  • Robustness
  • Costs

Functionality

Task Efficacy

  • Objective: Assess the ability of the model to perform the task for which it is designed.
  • Metric: Select a task-appropriate model quality evaluation metric.
  • Rationale: Measuring efficacy on a critical task is important to project success.
  • Implementation: Select a task-appropriate implementation of your metric.

Fairness

  • Objective: Data and models should be free of bias to avoid unfair treatment of certain groups, to ensure a fair distribution of benefits and costs, and to offer those affected an opportunity to seek redress against adverse decisions made by the system or the humans operating it (Chouldechova & Roth 2018).
  • Metric: Statistical metrics of fairness include raw positive classification rate (Feldman et al. 2015), false positive and false negative rates, or positive predictive value (Chouldechova 2017). However, every fairness metric includes tradeoffs, so if this is important to the system then the model and system teams must discuss the overall effects and the appropriate tradeoffs to ensure fairness.
  • Rationale: Biased models result in a degraded user experience for certain sub-populations, and can damage user trust in a system.
  • Implementation: Start by identifying the protected attribute in your dataset, and then determine what fairness measure the model and system should prioritize.
Research on Fairness
Fairness Questions
  • Are subsets or groups within your dataset equally likely to be classified or predicted?
  • If your model is being used on demographic groups, does your predictor produce similar outputs for similar individuals across demographic groups (Gajane & Pechenizkiy 2018)?
  • If your model feeds into a socio-technical system, will it dynamically affect the environment and the incentives of human actors who interact with the system?
  • Is your dataset potentially biased or skewed in some way?
Considerations and Methods for Implementing Fairness
  • Consider using metrics of statistical fairness (a small number of protected demographic groups should have parity of some statistical measure across all groups) such as raw positive classification rate (Feldman et al. 2015), false positive and false negative rates, or positive predictive value (last two from Chouldechova 2017).
  • Note that there are tradeoffs to individual versus statistical fairness, see Chouldechova & Roth 2018.
  • If there is a reliable and non-discriminating distance metric, see Gajane & Pechenizkiy's definition 4 for a test by which individual fairness can be measured.
  • Kannan et al. and Liu et al. demonstrate how to consider the dynamic effects of decisions on a system; using the context of your system, identify ways in which downstream effects might modify the social fabric and determine if those parts of the model or the system need to be modified accordingly.
  • Depending upon your knowledge of bias or skew in the data, consider using rank-preserving procedures for repairing features to reduce or remove pairwise dependence with the protected attribute from Feldman et al. 2015.

Interpretability

  • Objective: Some systems necessitate an ability to be explained or presented in human-understandable terms (Doshi-Velez & Kim 2017).
  • Metric: Interpretability is difficult to measure; it can be considered from an end-user perspective or from a developer perspective by observing and evaluating the interactions of these teams with the system, or having a domain expert explain model outputs in context (Doshi-Velez & Kim 2017).
  • Rationale: Depending on the system purpose, it may be critical for the system to be explainable and understandable.
  • Implementation: Options include, among others: intrinsic interpretability in which a model is self explanatory, or post-hoc interpretability where another model is created to explain outputs from the first (Du et al. 2019).
Research on Interpretability
Interpretability Questions
  • Is it important that the model is explainable to the user? Some machine learning systems do not require explainability because “(1) there are no significant consequences for unacceptable results or (2) the problem is sufficiently well-studied and validated in real applications that we trust the system’s decision, even if the system is not perfect” (Doshi-Velez & Kim 2017).
  • Can interpretability be done at the model-agnostic level and simply analyze outputs with respect to their context?
Considerations and Methods for Implementing Interpretability
  • If interpretability is important, consider using intrinsic interpretability (in which the model is self-explanatory) or post-hoc interpretability (create another model to explain outputs from the first) from Du et al. 2019. A domain expert can also be called upon to explain model outputs in their proper context (Doshi-Velez & Kim 2017).

Robustness

General Robustness Research

Capability Approach for Robustness in Computer Vision

  • Identify critical computer vision capabilities of the model to evaluate. See Ribeiro et al. for content on identifying capabilities and developing task tests. Favor the model that has best learned the most relevant capabilities. Computer vision capabilities to consider testing include:
  • Identifying shape
  • Robustness to altered texture
  • Robustness to novel backgrounds
  • Segmentation into regions
  • Teams can test types including minimum functionality, invariance, and directional expectation
  • Teams can curate test data through mutating existing inputs, generating new inputs, or obtaining new inputs.

Robustness to Naturally Occurring Data Challenges

  • Objective: Ensure that the model is robust to naturally occurring data challenges that it will encounter in the ambient conditions of the system (Berghoff et al. 2021).
  • Metric: Depending on the identified data challenges and the task specific properties, model robustness can be measured by a robustness score across the perturbation parameter space. This is a metric that calculates the fraction of correctly identified robust samples in the dataset. Reassessing the model accuracy with augmented datasets is a common metric for robustness (Berghoff et al. 2021).
  • Rationale: Models implemented in a system will experience common data challenges like illumination, motion blur, occlusion, changes in perspective, and weather impacts. These perturbations affect the data and can have significant impacts on the quality of the model prediction, so they must be addressed before deployment (Russell & Norivg).
  • Implementation: Dependent on the identified data challenges; the AutoAugment data augmentation policy proposed in Yin et. al is a recommended starting point. The Ribeiro et al. paper is also a useful tool to identify capabilities necessary for the model to promote robustness.
Research on Robustness to Naturally Occurring Data Challenges
  • Methods of addressing the potential naturally occurring data challenges that might arise from the ambient coditions of the system: Robustness Testing of AI Systems
Questions on Robustness to Naturally Occurring Data Challenges
  • What data challenges may occur when your model is deployed to the system? Could your model face:
    • Significant changes in illumination or color transformations (brightness, contrast, saturation, hue, grayscale, color-depth)?
    • Motion blur or other pixel perturbations?
      • Occlusion of the target object?
    • Changes in perspective (rotation, translation, scaling, shearing, blurring, sharpening, flipping)?
    • Weather impacts? (Russell & Norvig 2003, Ch. 25)
    • Other system specific conditions? (For example, stickers on objects or damaged objects)
  • What are the typical and atypical system conditions in which your model will be deployed?
  • How may data collection processes or physical sensors be degraded with time, use, or damage?
  • Are there any extreme distribution shifts or long tail events that could cause large accuracy drops? (Hendrycks et al. 2021)
Considerations and Methods for Implementing Robustness to Naturally Occurring Data Challenges
  • If there are known specific data challenges the model will face, consider prioritizing robustness to those perturbations. (For example, Gaussian data augmentation improves robustness to noise and blurring but degrades performance on fog and contrast (Yin et. al)). Generate a list of task specific properties and plot the model robustness (measured by robustness score, or fraction of correctly identified robust samples in the dataset) across the perturbation parameter space (Berghoff et al. 2021).
  • Otherwise, to achieve the most generally robust model the AutoAugment data augmentation policy proposed in Yin et. al achieves the most generalizable robustness to data augmentation.
  • You might also consider tying into the system-level framework in order to build in feedback loops that could influence the environment.

Robustness to Adversarial Attack

  • Objective: Ensure that the model is robust to synthetic manipulation or targeted adversarial attacks (Hendrycks et al. and McGraw et al. 2020).
  • Metric: There are performance metrics for adversarial robustness (Buzhinsky et al. 2020) and existing benchmarked adversarial robustness tools such as CleverHans, Foolbox, and the Adversarial Robustness Toolbox (ART) that may be used.
  • Rationale: A model deployed in a system may face different vulnerabilities (data pollution, physical infrastructure, etc.) and attacks (poisoning, extraction, inference, etc.) that can significantly degrade the performance, security, or safety of the model.
  • Implementation: Approaches to implementing robustness to adversarial attack vary depending on which methods of attack are most likely and most detrimental for your system.
Questions on Robustness to Adversarial Attack
  • How is an adversary most likely going to attempt to break your model?
  • What would be the most dangerous method an adversary could use to break your model?
  • Did you consider different types and natures of vulnerabilities such as data pollution, physical infrastructure, and cyber-attacks? (Hendrycks et al. 2021)
  • What is the threat of evasion attacks, poisoning attacks, extraction attacks, and inference attacks, and does the model need to be prepared to address these? (ART)
  • Did you consult with the systems team to put measures in place to ensure integrity and resilience of the system against attacks?
Considerations and Methods for Implementing Robustness to Adversarial Attack
  • Consider using performance metrics for adversarial robustness like in Buzhinsky et al. 2020. Adversarial robustness in the latent space is the “resilience” to the worst-case noise additions. Metrics include local latent adversarial robustness, generation severity, reconstructive severity, and reconstructive accuracy.
  • Generate simulations for possible adversarial attacks to predict behavior in settings that preclude practical testing of the system itself (Pezzementi et al. 2021). Evaluate model performance based on a metric like a robustness receiver operating curve (ROC).
  • Consider using a benchmarked adversarial robustness tool like CleverHans, Foolbox, or ART.
  • If it would be beneficial to detect adversarial anomalies or assign low confidence values to potential adversarial inputs, that is something that should be tied into the system framework.

Robustness to Device-Generated Perturbations

  • Objective: Ensure that the model and the system are robust to perturbations resulting from devices that are part of the system. An example of a device-generated perturbation would be a camera taking unfocused video or pictures, making it impossible for the computer vision model to detect objects.
  • Metric: If sensor redundancy is determined to be necessary, establish a common representation of the input and evaluate the system with simulated sensor failures. If robustness to single sensor noise is acceptable, determine the most likely sensor degradations and evaluate the mAP on a simulated degraded dataset.
  • Rationale: Models are often evaluated with full sensor availability. However, in a safety-critical system, unexpected scenarios like sensor degradation or failure must be accounted for.
  • Implementation: Depending on the system in which the model will be deployed, an option is to implement sensor redundancy. An architecture that uses multiple sensors to perform object detection jointly can provide robustness to sensor failure (Berntsson & Tonderski 2019). Alternatively, if typical sensor degradation patterns are known or possible to predict, robustness tests specific to the sensor can be designed (Seals 2019). For example, evaluating the mAP of the model against speckle noise, salt and pepper noise, contrast alterations, or Gaussian noise can be used to determine robustness.

Robustness to Synthetic Image Modifications

Addressing synthetic image modifications allows models to handle images that have been modified synthetically (for instance, a filter has been applied).

Questions on Robustness to Synthetic Image Modifications
  • Are there critical computer vision capabilities that would allow the model to generalize robustly beyond controlled training settings?
  • How may a human approach the task in the face of synthetic modifications? Is edge detection necessary to complete the task?
  • Is there existing knowledge or theories about the task that can be leveraged?
Capability Approach for Robustness to Synthetic Image Modifications
  • Identify critical computer vision capabilities of the model to evaluate. See Ribeiro et al. for content on identifying capabilities and developing task tests. Favor the model that has best learned the most relevant capabilities. Computer vision capabilities to consider testing include:
  • Identifying shape
  • Robustness to altered texture
  • Robustness to novel backgrounds
  • Segmentation into regions
  • Teams can test types including minimum functionality, invariance, and directional expectation
  • Teams can curate test data through mutating existing inputs, generating new inputs, or obtaining new inputs.

Security

  • Objective: Ensure that the model is insulated to compromise from internal error.
  • Metric: The metric by which security is measured will depend on what risks are most likely for your given model. Areas of focus could include adversarial attacks (as described above), reproducibility, overfitting, and output integrity among others. See McGraw et al. for a comprehensive list of risks and recommended methods of addressing them.
  • Rationale: A model and the system in which it is encased have numerous risk areas that can be traced back to intrinsic design flaws (McGraw et al. 2020).
  • Implementation: Prioritize risks based on your model and system, and address them in order of probability that they occur.

Costs

Model Size (Static)

  • Objective: Measure the static size of a trained model.
  • Metric: The storage requirement for the model in bytes or some multiple thereof (e.g. kilobytes, megabytes, etc.). This metric is absolute.
  • Rationale: A model’s static size is its size at rest, when it is ready to perform inference. The static size of the model may limit the infrastructure on which it may be deployed.
  • Implementation: Measure the on-disk size of the model static format. The exact implementation may vary based on the development platform and environment available. Examples of potential implementations are provided below.

On UNIX-like systems, use the du ("disk usage") command to measure model size:

# For models stored statically as a single file
$ du --bytes model

# For models stored statically as a directory
$ du --bytes model/*

On Windows systems, the Explorer GUI displays file and directory size. Alternatively, use the following commands in a Powershell interpreter to measure model size:

# For models stored statically as a single file
Get-Item model | Measure-Object -Property Length -Sum

# For models stored statically as a directory
Get-ChildItem model/ | Measure-Object -Property Length -Sum

Programmatically measuring the file size may be the most useful when automating this procedure in an ML pipeline. MLTE provides functionality for measuring the size of models stored on the local filesystem:

from mlte.measurement import model_size

path = # the path to model on local filesystem
size = model_size(path)
print(size)

Model Size (Dynamic)

  • Objective: Measure the dynamic size of a trained model in terms of its storage requirements.
  • Metric: The storage requirement for the model in bytes or some multiple thereof (e.g., kilobytes, megabytes, etc.). This metric is absolute.
  • Rationale: A model’s dynamic size is its size in a serialized form that is appropriate for transport (e.g. via removable media or over the network). The dynamic size of the model determines the difficulty (time requirement) of transporting the model. This concern manifests both internally during development of an automated training pipeline as well as externally during deployment. The dynamic size of a model may depend on the choice of serialization format, compression, and encryption, among other factors.
  • Implementation: Measure the on-disk size of the model dynamic format. The exact implementation may vary based on the development platform and environment available. Examples of potential implementations are provided below.

On UNIX-like systems, use the du ("disk usage") command to measure model size:

# For models stored statically as a single file
$ du --bytes model

# For models stored statically as a directory
$ du --bytes model/*

On Windows systems, the Explorer GUI displays file and directory size. Alternatively, use the following commands in a Powershell interpreter to measure model size:

# For models stored statically as a single file
Get-Item model | Measure-Object -Property Length -Sum

# For models stored statically as a directory
Get-ChildItem model/ | Measure-Object -Property Length -Sum

Programmatically measuring the file size may be the most useful when automating this procedure in an ML pipeline. MLTE provides functionality for measuring the size of models stored on the local filesystem:

from mlte.measurement import model_size

path = # the path to model on local filesystem
size = model_size(path)
print(size)

Training Time

  • Objective: Measure the total time required to train the model.
  • Metric: The wall-clock time required to run the model training process in seconds or some multiple thereof (e.g. minutes, hours, etc.). This metric is relative.
  • Rationale: Training time is a critical constraint on the machine learning pipeline. Long-training times limit the ability of the ML engineer to iterate on the model and make improvements during development. Long-training times also limit the frequency with which new models may be deployed to production.
  • Implementation: The wall-clock time required to train a machine learning model is highly-dependent upon the system on which training occurs. A system with better hardware properties (e.g. CPU cores, clock frequency, cache capacity, RAM capacity) trains faster than a weaker one. Whether or not a GPU is available, and the quality thereof, is another consideration. When the input dataset is large, storage system performance may become the bottleneck. For models that require distributed training, cluster properties confound these measurements. This variability necessitates a common benchmark infrastructure for model training time.

Training CPU Consumption

  • Objective: Measure the peak and average CPU utilization during model training.
  • Metric: The percentage of compute resources utilized by the training process as a percentage of the total compute available to the system on which it is evaluated. This metric is relative.
  • Rationale: The computational requirements of model training determine the load that it places on the system during the training procedure. Typically, we are not concerned with the efficiency of other jobs that run concurrently on the same machine during model training. Therefore, the peak and average CPU consumption of the training process are primarily relevant because they determine the resource requirements necessary to train efficiently. This metric is not directly applicable to a distributed training procedure.
  • Implementation: Measure the CPU utilization of the process running the training procedure. The measurement procedure will vary depending on the training environment.

Measure the CPU utilization of the training procedure with MLTE:

from mlte.monitoring import cpu_utilization

pid = # identifier of training process 

stats = cpu_utilization(pid)
print(stats)

Training Memory Consumption

  • Objective: Measure the peak and average memory consumption during model training.
  • Metric: The volume of memory consumed in bytes or some multiple thereof (kilobytes, megabytes, etc.). This metric is absolute.
  • Rationale: The memory requirements of model training determine the load that is places on the system during the training procedure. Typically, we are not concerned with the efficiency of other jobs that run concurrently on the same machine during model training. Therefore, the peak and average memory consumption of the training process are primarily relevant because they determine the resource requirements necessary to train efficiently. This metric is not directly applicable to a distributed training procedure.
  • Implementation: Measure the memory consumption of the process running the training procedure. The measurement procedure will vary depending on the training environment.

Measure the memory consumption of the training procedure with MLTE:

from mlte.monitoring import memory_consumption

pid = #  identifier of training process

stats = memory_consumption(pid)
print(stats)

Training Energy Consumption

  • Objective: Measure the energy consumption of the model training process.
  • Metric: The energy consumed by the training process in joules (total power consumption over a time interval).
  • Rationale: For large-scale machine learning applications, energy consumption may be a major driver in the total cost of development and maintenance. The model training process is frequently the most energy-intensive stage of the machine learning pipeline.
  • Implementation: Energy consumption and power requirements are a relatively-new consideration in the field of machine learning. Accordingly, methods for convenient and accurate measurement are limited.

Inference Latency (Mean)

  • Objective: Measure the mean inference latency of a trained model.
  • Metric: The time required to complete a single inference request, in milliseconds. This metric is relative.
  • Rationale: Inference latency refers to the time required for a trained model to make a single prediction given some input data. While the machine learning model is likely only a small part of the intelligent system in which it is integrated, it may contribute substantially to the overall latency of the service.
  • Implementation: Measure the latency of the model across many inference requests and compute the mean. The measurement procedure will vary based on the development environment.

Measure the mean latency of model inference with MLTE:

from mlte.measurement import mean_latency

model = # trained model that implements __call__()
d_gen = # input generator that implements __call__()

latency = mean_latency(model, d_gen)
print(f"Mean latency: {latency}ms")

Inference Latency (Tail)

  • Objective: Measure the tail inference latency of a trained model.
  • Metric: The time required to complete a single inference request, in milliseconds. This metric is relative.
  • Rationale: Tail latency refers to the latency of model inference at the (right) tail of the latency distribution. In many production environments, mean latency does not adequately reflect the production viability of model in terms of its runtime requirements. Instead, tail latency provides a more informative measure of the guarantees we can provide about model runtime performance.
  • Implementation: Measure the latency of the model across many inference requests and compute the desired tail percentile. The measurement procedure will vary based on the development environment.

Measure the tail latency of model inference with MLTE. By default, the tail_latency() function computes the 99th percentile latency, but this value may be changed via a keyword argument.

from mlte.measurement import tail_latency

model = # trained model that implements __call__()
d_gen = # input generator that implements __call__()

latency = tail_latency(model, d_gen)
print(f"Tail latency: {latency}ms")

Inference Throughput

  • Objective: Measure the inference throughput of a trained model.
  • Metric: The number of inference requests completed in one second. This metric is relative.
  • Rationale: For some applications, service throughput is a more important metric than service latency. In such cases, we may be unconcerned with the latency of inference requests to the model and more concerned with its throughput.
  • Implementation: Measure the throughput of the model by providing it with a stream of many inference requests, computing the time required to complete all of these requests, and dividing the number of completed requests by this duration. The measurement procedure will vary based on the

Measure the throughput of model inference with MLTE.

from mlte.measurement import throughput

model = # trained model that implements __call__()
d_gen = # input generator that implements __call__()

t_put = throughput(model, d_gen)
print(f"Throughput: {t_put} requests per second")

Inference CPU Consumption

  • Objective: Measure the peak and average CPU utilization during model inference.
  • Metric: The percentage of compute resources utilized by the inference service as a percentage of the total compute available to the system on which it is evaluated. This metric is relative.
  • Rationale: The computational requirements of model inference determine the load that it places on the system when performing inference. This is a key determinant in the compute resources required for model deployment. For example, a model for which inference is computationally inexpensive may be deployed to an instance with relatively light computational resources. This might allow for investment in other resources, such as memory capacity, for the instance to which the model is deployed.
  • Implementation: Measure the CPU utilization of the inference service. The setup for inference measurement may be more involved than training measurement because inference is often not run as a standalone process.

Inference Memory Consumption

  • Objective: Measure the peak and average memory consumption during model inference.
  • Metric: The volume of memory consumed in bytes or some multiple thereof (kilobytes, megabytes, etc.). This metric is absolute.
  • Rationale: The memory requirements of model inference determine the load that is places on the system during inference. This is a key determinant in the memory resources required for model deployment. For example, a model for which inference is not memory-intensive may be deployed to an instance with relatively light memory resources. This might allow for investment in other resources, such as core count, for the instance to which the model is deployed.
  • Implementation: Measure the memory consumption of the process during the inference procedure.

Inference Energy Consumption

  • Objective: Measure the energy consumption of the model inference process.
  • Metric: The energy consumed by the inference process in joules (total power consumption over a time interval).
  • Rationale: For large-scale machine learning applications, energy consumption may be a major driver in the total cost of development and maintenance.
  • Implementation: Energy consumption and power requirements are a relatively-new consideration in the field of machine learning. Accordingly, methods for convenient and accurate measurement are limited.

Process Resources

MLTE was created based on existing techniques and cutting-edge research for machine learning. This section gives some explanations of why the team made the choices we did for the MLTE framework and infrastructure.

Baseline and Performance Metric Selection

Information on Baseline Selection

  • Some datasets and methods already have an accepted baseline that can be used (for instance, PASCAL VOC is an object category recognition and detection benchmark).
  • Classify everything as the majority (as described by Chapter 7.2 of Hvitfeldt & Silge 2021).
  • If the model implements a task that is currently performed manually, conduct a test in which humans perform the task and use the human performance as the baseline.

Information on Performance Metric Selection

The choice of metric depends on the exact nature of the system being created; following are some examples to consider from commonly used disciplines of ML.

  • Classification:
    • Receiver Operating Characteristics (ROC) curves and the Area Under the Curve (AUC): Evaluation metrics for standard classification tasks.
    • Precision Recall Curves and Area Under the Precision Recall Curve (AUPRC): Used when there are class imbalances.
  • Object Detection:
    • Average Precision (AP) is the weighted mean of precisions achieved at each recall threshold.
    • mAP50: Used when detecting multiple classes. mAP50 is the precision accumulated over different levels of recall under the intersection over union (IOU) threshold of 0.50 (commonly used on the PASCAL VOC benchmark).
    • mAP: Extension of mAP50 that is averaged over ten IOU thresholds {0.5 : 0.05 : 0.95}, and is commonly used on the Microsoft Common Objects in Context (MS COCO) benchmark.

Resources on Machine Learning Pipelines and Processes

ML Training Best Practices

  • Ensuring that representative training and test data is available or provided for the problem at hand, and handling the data appropriately based on any associated permissions or authorities that are required.
  • Splitting the data correctly for training, validation, and testing.
  • Appropriately selecting a model type and then fine-tuning it.

Ch 2 End-to-End Machine Learning from Hands-On Machine Learning by Aurélien Géron

  1. Look at the big picture.
  2. Get the data.
  3. Explore and visualize the data to gain insights.
  4. Prepare the data for Machine Learning algorithms.
  5. Select a model and train it.
  6. Fine-tune your model.
  7. Present your solution.
  8. Launch, monitor, and maintain your system.

A Course in Machine Learning by Hal Daumé III

Introduction to Machine Learning with Python by Andreas C. Müller and Sarah Guido

  1. Ask if machine learning can solve the problem at hand.
  2. Find and obtain relevant data for the problem.
  3. Examine and understand the data.
  4. Build a model.
  5. Make predictions.
  6. Evaluate your model.

More ML-process related topics:

Generating Multiple Test Sets

If it is possible for multiple holdout test sets to be generated, using different ones for each evaluation in IMT and SDMT will produce the best results.

However, it is often not possible for practitioners to generate , and there is research to support that substantial overfitting does not occur even if a single test set is used multiple times (Roelofs et al. 2019).

To differentiate between evaluations, we recommend ensuring good version control for models as they are the unit by which MLTE tracks evaluations.

Model Property Definition

Research on model property definition.

Requirements Selection

Research on requirements selection.