Web Neural Network API

Draft Community Group Report,

This version:
https://webmachinelearning.github.io/webnn/
Issue Tracking:
GitHub
Editor:
Ningxin Hu (Intel Corporation)

Abstract

This document describes a dedicated low-level API for neural network inference hardware acceleration.

Status of this document

This specification was published by the Machine Learning for the Web Community Group. It is not a W3C Standard nor is it on the W3C Standards Track. Please note that under the W3C Community Contributor License Agreement (CLA) there is a limited opt-out and other conditions apply. Learn more about W3C Community and Business Groups.

1. Introduction

Introduction here.

2. Use cases

2.1. Application Use Cases

This section illustrates application-level use cases for neural network inference hardware acceleration. All applications in those use cases can be built on top of pre-trained deep neural network (DNN) models.

2.1.1. Person Detection

A user opens a web-based video conferencing application, but she temporarily leaves from her room. The application is watching whether she is in front of her PC by using object detection (for example, using object detection approaches such as [SSD] or [YOLO] that use a single DNN) to detect regions in a camera input frame that include persons.

When she comes back, the application automatically detects her and notifies other online users that she is active now.

2.1.2. Semantic Segmentation

A user joins a teleconference via a web-based video conferencing application at her desk since no meeting room in her office is available. During the teleconference, she does not wish that her room and people in the background are visible. To protect the privacy of the other people and the surroundings, the application runs a machine learning model such as [DeepLabv3+] or [MaskR-CNN] to semantically split an image into segments and replaces segments that represent other people and background with another picture.

2.1.3. Skeleton Detection

A web-based video conferencing application tracks a pose of user’s skeleton by running a machine learning model, which allows for real-time human pose estimation, such as [PoseNet] to recognize her gesture and body language. When she raises her hand, her microphone is automatically unmuted and she can start speaking on the teleconference.

2.1.4. Face Recognition

There are multiple people in the conference room and they join an online meeting using a web-based video conferencing application. The application detects faces of participants by using object detection (for example, using object detection approaches such as [SSD]) and checks whether each face was present at the previous meeting or not by running a machine learning model such as [FaceNet], which verifies whether two faces would be identical or not.

2.1.5. Facial Landmark Detection

A user wants to find new glasses that beautifully fits her on an online glasses store. The online store offers web-based try-on simulator that runs a machine learning model such as Face Alignment Network [FAN] to detect facial landmarks like eyes, nose, mouth, etc. When she chooses a pair of glasses, the simulator properly render the selected glasses on the detected position of eyes on her facial image.

2.1.6. Style Transfer

A user is looking for cosmetics on an online store and wondering which color may fit her face. The online store shows sample facial makeup images of cosmetics, and offers makeup simulator that runs a machine learning model like [ContextualLoss] or [PairedCycleGAN] to transfer the makeup style of the sample makeup image to her facial image. She can check how the selected makeup looks like on her face by the simulator.

2.1.7. Super Resolution

A web-based video conferencing is receiving a video stream from its peer, but the resolution of the video becomes lower due to network congestion. To prevent degradation of the perceived video quality, the application runs a machine learning model for super-resolution such as [SRGAN] to generate higher-resolution video frames.

2.1.8. Image Captioning

For better accessibility, a web-based presentation application provides automatic image captioning by running a machine learning model such as [im2txt] which predicts explanatory words of the presentation slides.

2.1.9. Machine Translation

Multiple people from various countries are talking via a web-based real-time text chat application. The application translates their conversation by using a machine learning model such as [GNMT] or [OpenNMT], which translates every text into different language.

2.1.10. Emotion Analysis

A user is talking to her friend via a web-based real-time text chat application, and she is wondering how the friend feels because she cannot see the friend’s face. The application analyses the friend’s emotion by using a machine learning model such as [DeepMoji], which infers emotion from input texts, and displays an emoji that represents the estimated emotion.

2.1.11. Video Summarization

A web-based video conferencing application records received video streams, and it needs to reduce recorded video data to be stored. The application generates the short version of the recorded video by using a machine learning model for video summarization such as [Video-Summarization-with-LSTM].

2.2. Framework Use Cases

This section collects framework-level use cases for a dedicated low-level API for neural network inference hardware acceleration. It is expected that Machine Learning frameworks will be key consumers of the Web Neural Network API (WebNN API) and the low-level details exposed through the WebNN API are abstracted out from typical web developers. However, it is also expected that web developers with specific interest and competence in Machine Learning will want to interface with the WebNN API directly instead of a higher-level ML framework.

2.2.1. Custom Layer

A web application developer wants to run a DNN model on the WebNN API. However, she has found that some of activation functions like [LeakyReLU], [ELU], etc. are not included in the WebNN API. To address this issue, she constructs custom layers of the additional activation functions on top of the WebNN API. Note that the scope of custom layers may include convolution, normalization, etc. as well as activation.

2.2.2. Network Concatenation

A web application uses a DNN model, and its model data of upper convolutional layers and lower fully-connected layers are stored in separate files, since model data of the fully-connected layers are periodically updated due to fine tuning at the server side.

Therefore, the application downloads both partial model files at first and concatenates them into a single model. When the model is updated, the application downloads fine-tuned part of the model and replace only the fully-connected layers with it.

2.2.3. Performance Adaptation

A web application developer has a concern about performance of her DNN model on mobile devices. She has confirmed that it may run too slow on mobile devices which do not have GPU acceleration. To address this issue, her web application refers to the WebNN API to confirm whether acceleration is available or not, so that the application can display the warning for devices without acceleration.

After several weeks, she has developed a tiny DNN model that can even run on CPU. In order to accommodate CPU execution, she modifies the application so that the application loads the tiny model in the case of CPU-only devices.

3. API

3.1. Navigator

partial interface Navigator {
  readonly attribute ML ml;
};

3.2. ML

interface ML {
  NeuralNetworkContext getNeuralNetworkContext();
};

3.3. NeuralNetworkContext

interface NeuralNetworkContext {
  // Operand types.
  const long FLOAT32 = 0;
  const long INT32 = 1;
  const long UINT32 = 2;
  const long TENSOR_FLOAT32 = 3;
  const long TENSOR_INT32 = 4;
  const long TENSOR_QUANT8_ASYMM = 5;

  // Operation types.
  const long ADD = 0;
  const long AVERAGE_POOL_2D = 1;
  const long CONCATENATION = 2;
  const long CONV_2D = 3;
  const long DEPTHWISE_CONV_2D = 4;
  const long DEPTH_TO_SPACE = 5;
  const long DEQUANTIZE = 6;
  const long EMBEDDING_LOOKUP = 7;
  const long FLOOR = 8;
  const long FULLY_CONNECTED = 9;
  const long HASHTABLE_LOOKUP = 10;
  const long L2_NORMALIZATION = 11;
  const long L2_POOL_2D = 12;
  const long LOCAL_RESPONSE_NORMALIZATION = 13;
  const long LOGISTIC = 14;
  const long LSH_PROJECTION = 15;
  const long LSTM = 16;
  const long MAX_POOL_2D = 17;
  const long MUL = 18;
  const long RELU = 19;
  const long RELU1 = 20;
  const long RELU6 = 21;
  const long RESHAPE = 22;
  const long RESIZE_BILINEAR = 23;
  const long RNN = 24;
  const long SOFTMAX = 25;
  const long SPACE_TO_DEPTH = 26;
  const long SVDF = 27;
  const long TANH = 28;
  const long BATCH_TO_SPACE_ND = 29;
  const long TRANSPOSE = 37;
  const long MAXIMUM = 65;

  // Fused activation function types.
  const long FUSED_NONE = 0;
  const long FUSED_RELU = 1;
  const long FUSED_RELU1 = 2;
  const long FUSED_RELU6 = 3;

  // Implicit padding algorithms.
  const long PADDING_SAME = 1;
  const long PADDING_VALID = 2;

  // Compilation preferences.
  const long PREFER_LOW_POWER = 0;
  const long PREFER_FAST_SINGLE_ANSWER = 1;
  const long PREFER_SUSTAINED_SPEED = 2;

  Promise<Model> createModel();
};

3.4. OperandOptions

dictionary OperandOptions {
  // The operand type.
  required long type;

  // The dimensions field is only required for tensor operands.
  sequence<unsigned long> dimensions;

  // The following two fields are only required for quantized operand.
  // scale: an non-negative floating point value
  // zeroPoint: an integer, in range [0, 255]
  // The real value is (value - zeroPoint) * scale
  float scale;
  long zeroPoint;
};

3.5. Model

interface Model {
  void addOperand(OperandOptions options);
  void setOperandValue(unsigned long index, ArrayBufferView data);
  void addOperation(long type, sequence<unsigned long> inputs,
                    sequence<unsigned long> outputs);
  void identifyInputsAndOutputs(sequence<unsigned long> inputs,
                                sequence<unsigned long> outputs);
  Promise<void> finish();
  Promise<Compilation> createCompilation();
};

3.6. Compilation

interface Compilation {
  void setPreference(long preference);
  Promise<void> finish();
  Promise<Execution> createExecution();
};

3.7. Execution

interface Execution {
  void setInput(unsigned long index, ArrayBufferView data);
  void setOutput(unsigned long index, ArrayBufferView data);
  Promise<void> startCompute();
};

4. Examples

The following code gets the NeuralNetworkContext object.
const nn = navigator.ml.getNeuralNetworkContext();
The following code builds a graph as:
tensor0 ---+
           +--- ADD ---> intermediateOutput0 ---+
tensor1 ---+                                    |
                                                +--- MUL---> output
tensor2 ---+                                    |
           +--- ADD ---> intermediateOutput1 ---+
tensor3 ---+

The tensor0 and tensor2 are constants. The tensor1 and tensor3 are user inputs.

// Use tensors in 4 dimensions.
const TENSOR_DIMS = [2, 2, 2, 2];
const TENSOR_SIZE = 16;

// Track operand index.
let operandIndex = 0;

// Create a Model object.
const model = await nn.createModel();

// Create OperandOptions objects.
const float32TensorType = {type: nn.TENSOR_FLOAT32, dimensions: TENSOR_DIMS};
const scalarInt32Type = {type: nn.INT32};

// Add the operand for the NONE activation function and set its value to
// FUSED_NONE.
const fusedActivationFuncNone = operandIndex++;
model.addOperand(scalarInt32Type);
model.setOperandValue(fusedActivationFuncNone, new Int32Array([nn.FUSED_NONE]));

// tensor0 is a constant tensor. Set its value from an ArrayBuffer object.
// The ArrayBuffer object may contain the training data loaded before hand.
const tensor0 = operandIndex++;
model.addOperand(float32TensorType);
model.setOperandValue(tensor0, new Float32Array(arrayBuffer, 0, TENSOR_SIZE));

// tensor1 is one of the input tensors. Its value will be set before execution.
const tensor1 = operandIndex++;
model.addOperand(float32TensorType);

// tensor2 is another constant tensor. Set its value from same ArrayBuffer
// object with offset.
const tensor2 = operandIndex++;
model.addOperand(float32TensorType);
model.setOperandValue(tensor2,
                      new Float32Array(arrayBuffer,
                                       TENSOR_SIZE * Float32Array.BYTES_PER_ELEMENT,
                                       TENSOR_SIZE));

// tensor3 is another input tensor. Its value will be set before execution.
const tensor3 = operandIndex++;
model.addOperand(float32TensorType);

// intermediateOutput0 is the output of the first ADD operation.
const intermediateOutput0 = operandIndex++;
model.addOperand(float32TensorType);

// intermediateOutput1 is the output of the second ADD operation.
const intermediateOutput1 = operandIndex++;
model.addOperand(float32TensorType);

// output is the output tensor of the MUL operation.
const output = operandIndex++;
model.addOperand(float32TensorType);

// Add the first ADD operation.
model.addOperation(nn.ADD, [tensor0, tensor1, fusedActivationFuncNone],
                   [intermediateOutput0]);

// Add the second ADD operation.
model.addOperation(nn.ADD, [tensor2, tensor3, fusedActivationFuncNone],
                   [intermediateOutput1]);

// Add the MUL operation. The intermediateOutput0 and intermediateOutput1 are
// inputs to the MUL operation.
model.addOperation(
    nn.MUL,
    [intermediateOutput0, intermediateOutput1, fusedActivationFuncNone],
    [output]);

// Identify the input and output tensors of the model.
model.identifyInputsAndOutputs([tensor1, tensor3], [output]);

// Finish building the model.
await model.finish();
The following code compiles the graph.
// Create a Compilation object for the constructed model.
const compilation = await model.createCompilation();

// Set the preference for the Compilation object.
compilation.setPreference(nn.PREFER_FAST_SINGLE_ANSWER);

// Finish the compilation.
await compilation.finish();
The following code executes the compiled graph.
// Create an Execution object for the compiled model.
const execution = await compilation.createExecution();

// Setup the input tensors that contain the input data provided by the user.
const inputTensor1 = new Float32Array(TENSOR_SIZE);
inputTensor1.fill(inputValue1);
const inputTensor2 = new Float32Array(TENSOR_SIZE);
inputTensor2.fill(inputValue2);

// Associate the input tensors to model’s inputs.
execution.setInput(0, inputTensor1);
execution.setInput(1, inputTensor2);

// Associate the output tensor to model’s output.
let outputTensor = new Float32Array(TENSOR_SIZE);
execution.setOutput(0, outputTensor);

// Start the asynchronous computation.
await execution.startCompute();
// The computed result is now in outputTensor.

5. Acknowledgements

This specification follows the concepts of the Android Neural Networks API C API.

Thanks to Tomoyuki Shimizu, Ningxin Hu, and Zhiqiang Yu for the use cases.

Conformance

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. [RFC2119]

Examples in this specification are introduced with the words “for example” or are set apart from the normative text with class="example", like this:

This is an example of an informative example.

Informative notes begin with the word “Note” and are set apart from the normative text with class="note", like this:

Note, this is an informative note.

Index

Terms defined by this specification

Terms defined by reference

References

Normative References

[HTML]
Anne van Kesteren; et al. HTML Standard. Living Standard. URL: https://html.spec.whatwg.org/multipage/
[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://tools.ietf.org/html/rfc2119
[WebIDL]
Boris Zbarsky. Web IDL. 15 December 2016. ED. URL: https://heycam.github.io/webidl/

Informative References

[ContextualLoss]
Roey Mechrez; Itamar Talmi; Lihi Zelnik-Manor. The Contextual Loss for Image Transformation with Non-Aligned Data. July 2018. URL: https://arxiv.org/abs/1803.02077
[DeepLabv3+]
Liang-Chieh Chen; et al. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. August 2018. URL: https://arxiv.org/abs/1802.02611
[DeepMoji]
Bjarke Felbo; et al. Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. October 2017. URL: https://arxiv.org/abs/1708.00524
[ELU]
Djork-Arné Clevert; Thomas Unterthiner; Sepp Hochreiter. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). February 2016. URL: https://arxiv.org/abs/1511.07289
[FaceNet]
Florian Schroff; Dmitry Kalenichenko; James Philbin. FaceNet: A Unified Embedding for Face Recognition and Clustering. June 2015. URL: https://arxiv.org/abs/1503.03832
[FAN]
Adrian Bulat; Georgios Tzimiropoulos. How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks). September 2017. URL: https://arxiv.org/abs/1703.07332
[GNMT]
Minh-Thang Luong; Eugene Brevdo; Rui Zhao. Neural Machine Translation (seq2seq) Tutorial. May 2017. URL: https://github.com/tensorflow/nmt
[IM2TXT]
Oriol Vinyals; et al. Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge. September 2016. URL: https://arxiv.org/abs/1609.06647
[LeakyReLU]
Andrew L. Maas; Awni Y. Hannun; Andrew Y. Ng. Rectifier Nonlinearities Improve Neural Network Acoustic Models. June 2013. URL: https://pdfs.semanticscholar.org/367f/2c63a6f6a10b3b64b8729d601e69337ee3cc.pdf
[MaskR-CNN]
Kaiming He; et al. Mask R-CNN. January 2018. URL: https://arxiv.org/abs/1703.06870
[OpenNMT]
Guillaume Klein; et al. OpenNMT: Open-Source Toolkit for Neural Machine Translation. March 2017. URL: https://arxiv.org/abs/1701.02810
[PairedCycleGAN]
Huiwen Chang; et al. PairedCycleGAN: Asymmetric Style Transfer for Applying and Removing Makeup. June 2018. URL: http://openaccess.thecvf.com/content_cvpr_2018/html/Chang_PairedCycleGAN_Asymmetric_Style_CVPR_2018_paper.html
[PoseNet]
Dan Oved. Real-time Human Pose Estimation in the Browser with TensorFlow.js. May 2018. URL: https://medium.com/tensorflow/real-time-human-pose-estimation-in-the-browser-with-tensorflow-js-7dd0bc881cd5
[SRGAN]
Christian Ledig; et al. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. May 2017. URL: https://arxiv.org/abs/1609.04802
[SSD]
Wei Liu; et al. SSD: Single Shot MultiBox Detector. December 2016. URL: https://arxiv.org/abs/1512.02325
[Video-Summarization-with-LSTM]
Ke Zhang; et al. Video summarization with long short-term memory. October 2016. URL: http://www-scf.usc.edu/~zhan355/ke_eccv2016.pdf
[YOLO]
Joseph Redmon; et al. You Only Look Once: Unified, Real-Time Object Detection. May 2016. URL: https://arxiv.org/abs/1506.02640

IDL Index

partial interface Navigator {
  readonly attribute ML ml;
};

interface ML {
  NeuralNetworkContext getNeuralNetworkContext();
};

interface NeuralNetworkContext {
  // Operand types.
  const long FLOAT32 = 0;
  const long INT32 = 1;
  const long UINT32 = 2;
  const long TENSOR_FLOAT32 = 3;
  const long TENSOR_INT32 = 4;
  const long TENSOR_QUANT8_ASYMM = 5;

  // Operation types.
  const long ADD = 0;
  const long AVERAGE_POOL_2D = 1;
  const long CONCATENATION = 2;
  const long CONV_2D = 3;
  const long DEPTHWISE_CONV_2D = 4;
  const long DEPTH_TO_SPACE = 5;
  const long DEQUANTIZE = 6;
  const long EMBEDDING_LOOKUP = 7;
  const long FLOOR = 8;
  const long FULLY_CONNECTED = 9;
  const long HASHTABLE_LOOKUP = 10;
  const long L2_NORMALIZATION = 11;
  const long L2_POOL_2D = 12;
  const long LOCAL_RESPONSE_NORMALIZATION = 13;
  const long LOGISTIC = 14;
  const long LSH_PROJECTION = 15;
  const long LSTM = 16;
  const long MAX_POOL_2D = 17;
  const long MUL = 18;
  const long RELU = 19;
  const long RELU1 = 20;
  const long RELU6 = 21;
  const long RESHAPE = 22;
  const long RESIZE_BILINEAR = 23;
  const long RNN = 24;
  const long SOFTMAX = 25;
  const long SPACE_TO_DEPTH = 26;
  const long SVDF = 27;
  const long TANH = 28;
  const long BATCH_TO_SPACE_ND = 29;
  const long TRANSPOSE = 37;
  const long MAXIMUM = 65;

  // Fused activation function types.
  const long FUSED_NONE = 0;
  const long FUSED_RELU = 1;
  const long FUSED_RELU1 = 2;
  const long FUSED_RELU6 = 3;

  // Implicit padding algorithms.
  const long PADDING_SAME = 1;
  const long PADDING_VALID = 2;

  // Compilation preferences.
  const long PREFER_LOW_POWER = 0;
  const long PREFER_FAST_SINGLE_ANSWER = 1;
  const long PREFER_SUSTAINED_SPEED = 2;

  Promise<Model> createModel();
};

dictionary OperandOptions {
  // The operand type.
  required long type;

  // The dimensions field is only required for tensor operands.
  sequence<unsigned long> dimensions;

  // The following two fields are only required for quantized operand.
  // scale: an non-negative floating point value
  // zeroPoint: an integer, in range [0, 255]
  // The real value is (value - zeroPoint) * scale
  float scale;
  long zeroPoint;
};

interface Model {
  void addOperand(OperandOptions options);
  void setOperandValue(unsigned long index, ArrayBufferView data);
  void addOperation(long type, sequence<unsigned long> inputs,
                    sequence<unsigned long> outputs);
  void identifyInputsAndOutputs(sequence<unsigned long> inputs,
                                sequence<unsigned long> outputs);
  Promise<void> finish();
  Promise<Compilation> createCompilation();
};

interface Compilation {
  void setPreference(long preference);
  Promise<void> finish();
  Promise<Execution> createExecution();
};

interface Execution {
  void setInput(unsigned long index, ArrayBufferView data);
  void setOutput(unsigned long index, ArrayBufferView data);
  Promise<void> startCompute();
};