In a previous post, we covered how to use Wav2Vec2.0 in Python and export to the ONNX format. Now, let’s try to use this model in Unity. We will use Unity Sentis 2.1, as it is the official way to use ONNX model as of today.
If we were to stick to Python, we could run a program in parallel of the Unity code as we did in Speech-to-world, but this time let’s use the model directly in Unity.
Using ONNX in Unity
Let’s ship our ONNX file to Unity. You need to install Unity Sentis first, and then you will be able to import the ONNX file as any other asset.
As an input, we can simply provide the audio file, and we get the logits as outputs.
Lets use the following (partial) code:
using NSubstitute.Core;
using Unity.Sentis;
using UnityEngine;
public class Wav2Vec2Manager : MonoBehaviour
{
[SerializeField, Tooltip("Wav2Vec2 model")]
private ModelAsset modelAsset;
[SerializeField, Tooltip("Audio clip to use, 16 kHz only!")]
private AudioClip audioClip;
/// <summary>
/// Convert CPU tensor of size (W, H) to a nested array.
/// </summary>
static float[][] TensorToArray(Tensor<float> tensor)
{
var rawLogits = tensor.DownloadToArray();
var logits = new float[tensor.shape[0]][];
for (int i = 0; i < tensor.shape[0]; i++)
{
logits[i] = rawLogits[(i * tensor.shape[1])..((i + 1) * tensor.shape[1])];
}
return logits;
}
// Start is called before the first frame update
void Start()
{
Model model = ModelLoader.Load(modelAsset);
float[] audioArray = new float[audioClip.samples];
audioClip.GetData(audioArray, 0);
Tensor<float> inputTensor = new(new TensorShape(1, audioClip.samples), audioArray);
// Start computing
Worker worker = new(model, BackendType.CPU);
worker.Schedule(inputTensor);
// Get result (as logit)
Debug.Log("Computation done!");
var outputTensor = worker.PeekOutput() as Tensor<float>;
outputTensor.Reshape(new TensorShape(outputTensor.shape[1], outputTensor.shape[2]));
var cpuTensor = outputTensor.ReadbackAndClone();
var logits = TensorToArray(cpuTensor);
// ... continue from here
}
}
With the previous code, cpuTensor will be our model output, and we can read it values with cpuTensor.DownloadAsArray(). TensorToArray is simply a way to convert the output W * H array to a nested array of sizes W, H.
From logits to phonemes
Let’s have a look at our logits tensor. It is a matrix that represents the predicted state for each phoneme, for each time step of 20 ms. A higher value represents a higher confidence level in the predicted phoneme.
If you use a simple word, we get the following matrix:
To convert from logits the predicted phoneme, lets create an Arg Max function inside Unity. We can either compute it from an array of array, or use the Arg Max method in the Functional class.
The first approach involves converting a Tensor to an array (with TensorToArray), then running a classical Arg Max. We can use this code:
/// <summary>
/// Compute Arg Max from a nested array.
/// </summary>
/// <param name="float">Input tensor, float[width][height]</param>
/// <returns>Arg Max of the output, int[witdh].</returns>
static int[] GetArgMax(float[][] floats)
{
int[] output = new int[floats.Length];
float maxVal;
int rowIndex;
for (int i = 0; i < output.Length; i++)
{
maxVal = float.MinValue;
for (int j = 0; j < floats[i].Length; j++)
{
if (floats[i][j] > maxVal)
{
maxVal = floats[i][j];
rowIndex = j;
output[i] = rowIndex;
}
}
}
return output;
}Alternatively, you may want to work directly on tensors. The main issue is as of Sentis 2.1.0, you cannot use the Functional class without creating a model. For the sake of this demonstration, we will create a model each time the method is called, but it is not optimal.
/// <summary>
/// Compute Arg Max directly from a tensor, by creating a new model.
/// </summary>
/// <param name="tensor">Input tensor, float[width][height]</param>
/// <returns>Arg Max of the output, int[witdh].</returns>
static int[] AiArgMax(Tensor<float> tensor)
{
// Create the model
FunctionalGraph graph = new();
var input = graph.AddInput<float>(tensor.shape);
var argMax = Functional.ArgMax(input, dim: 1);
var modelWithSoftmax = graph.Compile(argMax);
// Run the model
Worker worker = new(modelWithSoftmax, BackendType.CPU);
worker.Schedule(tensor);
var outputTensor = worker.PeekOutput() as Tensor<int>;
var cpuTensor = outputTensor.ReadbackAndClone();
// Clean-up and get ouptput
var outputArray = cpuTensor.DownloadToArray();
outputTensor.Dispose();
cpuTensor.Dispose();
return outputArray;
}We selected our best phoneme candidate ID with either GetArgMax or AiArgMax. We can proceed to the conversion of this ID to the corresponding phoneme as a string.
To simplify things, let’s just pass the phonemes IDs as a simple array of strings known as tokens.
// Tokens from vocab.json
readonly string[] tokens = {
"|", "aɪ", "aʊ", "b", "d", "d͡ʒ", "eɪ", "f", "h", "i", "j",
"k", "l", "m", "n", "oʊ", "p", "s", "t", "t͡ʃ", "u", "v", "w",
"z", "æ", "ð", "ŋ", "ɑ", "ɔ", "ɔɪ", "ə", "ɚ", "ɛ", "ɡ", "ɪ", "ɹ",
"ʃ", "ʊ", "ʌ", "ʒ", "θ", "[UNK]", "[PAD]"
};Now, we can simply iterate through the logits output, and use them as an input for tokens. Special “phonemes” such as “|”, “[UNK]” and “[PAD]” can be substituted by “” if you don’t want to display them.
var tokenIds = GetArgMax(logits);
string[] phonemes = new string[tokenIds.Length];
for (int i = 0; i < tokenIds.Length; i++)
{
phonemes[i] = tokens[tokenIds[i]];
}
Debug.Log(phonemes.Join(""));Now, phonemes is our list of phonemes!
Putting things together
Let’s provide the full script for convenience:
using NSubstitute.Core;
using Unity.Sentis;
using UnityEngine;
public class Wav2Vec2Manager : MonoBehaviour
{
[SerializeField, Tooltip("Wav2Vec2 model")]
private ModelAsset modelAsset;
[SerializeField, Tooltip("Audio clip to use, 16 kHz only!")]
private AudioClip audioClip;
/// <summary>
/// Convert CPU tensor of size (W, H) to a nested array.
/// </summary>
/// <param name="tensor">Input tensor.</param>
/// <returns>Array of the same size as input.</returns>
static float[][] TensorToArray(Tensor<float> tensor)
{
var rawLogits = tensor.DownloadToArray();
var logits = new float[tensor.shape[0]][];
for (int i = 0; i < tensor.shape[0]; i++)
{
logits[i] = rawLogits[(i * tensor.shape[1])..((i + 1) * tensor.shape[1])];
}
return logits;
}
/// <summary>
/// Compute Arg Max from a nested array.
/// </summary>
/// <param name="float">Input tensor, float[width][height]</param>
/// <returns>Arg Max of the output, int[witdh].</returns>
static int[] GetArgMax(float[][] floats)
{
int[] output = new int[floats.Length];
float maxVal;
int rowIndex;
for (int i = 0; i < output.Length; i++)
{
maxVal = float.MinValue;
for (int j = 0; j < floats[i].Length; j++)
{
if (floats[i][j] > maxVal)
{
maxVal = floats[i][j];
rowIndex = j;
output[i] = rowIndex;
}
}
}
return output;
}
/// <summary>
/// Compute Arg Max directly from a tensor, by creating a new model.
/// </summary>
/// <param name="tensor">Input tensor, float[width][height]</param>
/// <returns>Arg Max of the output, int[witdh].</returns>
static int[] AiArgMax(Tensor<float> tensor)
{
FunctionalGraph graph = new();
var input = graph.AddInput<float>(tensor.shape);
var argMax = Functional.ArgMax(input, dim: 1);
var modelWithSoftmax = graph.Compile(argMax);
Worker worker = new(modelWithSoftmax, BackendType.CPU);
worker.Schedule(tensor);
var outputTensor = worker.PeekOutput() as Tensor<int>;
var cpuTensor = outputTensor.ReadbackAndClone();
var outputArray = cpuTensor.DownloadToArray();
outputTensor.Dispose();
cpuTensor.Dispose();
return outputArray;
}
// Start is called before the first frame update
void Start()
{
Model model = ModelLoader.Load(modelAsset);
float[] audioArray = new float[audioClip.samples];
audioClip.GetData(audioArray, 0);
Tensor<float> inputTensor = new(new TensorShape(1, audioClip.samples), audioArray);
// Start computing
Worker worker = new(model, BackendType.CPU);
worker.Schedule(inputTensor);
// Get result (as logit)
Debug.Log("Computation done!");
var outputTensor = worker.PeekOutput() as Tensor<float>;
outputTensor.Reshape(new TensorShape(outputTensor.shape[1], outputTensor.shape[2]));
var cpuTensor = outputTensor.ReadbackAndClone();
// Convert to logits
var logits = TensorToArray(cpuTensor);
// Convert to phonemes
var altTokenIds = AiArgMax(cpuTensor);
var tokenIds = GetArgMax(logits);
string[] phonemes = new string[tokenIds.Length];
for (int i = 0; i < tokenIds.Length; i++)
{
phonemes[i] = tokens[tokenIds[i]];
}
Debug.Log(phonemes.Join(""));
// Clean-up
outputTensor.Dispose();
cpuTensor.Dispose();
}
/// <summary>
/// Tokens from vocab.json
/// </summary>
readonly string[] tokens = {
"|", "aɪ", "aʊ", "b", "d", "d͡ʒ", "eɪ", "f", "h", "i", "j",
"k", "l", "m", "n", "oʊ", "p", "s", "t", "t͡ʃ", "u", "v", "w",
"z", "æ", "ð", "ŋ", "ɑ", "ɔ", "ɔɪ", "ə", "ɚ", "ɛ", "ɡ", "ɪ", "ɹ",
"ʃ", "ʊ", "ʌ", "ʒ", "θ", "[UNK]", "[PAD]"
};
}
Your model is ready to be used! With an audio file saying “test”, the output should be “t ɛ s t”.
Conclusion
We get a pretty good model, with an integration to Unity. Currently the audio file provided should be ready as an asset, but in next version I want to input from the microphone. The final target is to have the model tell us if a word was correctly pronounced, so stay tuned!







