Blog

  • Wav2Vec2.0 in Unity Sentis: export to ONNX

    Wav2Vec2.0 in Unity Sentis: export to ONNX

    Lately I have been digging around the use of Automatic Speech Recognition (ASR) systems, and their use in Unity to analyze speech without requiring a high-end graphic card. It drove me back to integrating Hugging Face model inside Unity, let’s have a look at the method used.

    This is the first one of a series of two posts, and here we will focus on preparing the model for the outside world.

    Finding an ASR model

    First, we need to choose an ASR model. As of February 2025, Whisper is know to be an excellent generalist model. However, it is resources consuming, and is not really good at detecting isolated words. On the other hand, Wav2Vec2.0 is a simpler model, and it gives better results for word-level detection while using less resources.

    I actually implemented both, yet for the sake of this post, we will have a look at Wav2Vec2.0. On of the reasons is that Whisper already has some nice integrations (whisper.unity, simpler export to ONNX) as it is more popular.

    We will go with the specific checkpoint (trained model) Wav2Vec2 LJSpeech Gruut. It is a phoneme model, its out won’t be a word such as “hello”, but a string in IPA, so “hɛlˈoʊ”. It will be easier to check if a non-word is correct this way.

    Usage in Python

    Our Wav2Vec2Phoneme model is available on Hugging Face, which has most of its libraries in Python, so let’s start with this language.

    We will define our first function to import the model checkpoint:

    import pathlib
    from itertools import groupby
    
    from datasets import load_dataset
    import librosa
    import numpy as np
    import onnx
    import onnxruntime as ort
    import torch
    import transformers
    
    
    def get_model():
        """
        Get the Wav2Vec2.0 model.
    
        The processor will be a Wav2Vec2Processor object.
    
        Source code: https://github.com/huggingface/transformers/tree/main/src/transformers/models/wav2vec2
        Looking at the code, the processor for audio may only be a feature extractor.
    
        :return tuple: Model and processor.
        """
        checkpoint = "bookbot/wav2vec2-ljspeech-gruut"
    
        model = transformers.AutoModelForCTC.from_pretrained(checkpoint)
        processor = transformers.AutoProcessor.from_pretrained(checkpoint)
        return model, processor

    Now lets define some input audio data.

    def get_audio_data(local_file=None):
        """Read an audio file as an audio array."""
        if local_file is None:
            # load dummy dataset and read soundfiles
            ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
            audio_array = ds[0]["audio"]["array"]
        else:
            # or, read a single audio file
            audio_array, _ = librosa.load(local_file, sr=16000)
        return audio_array
    

    Finally let’s run the model and get a prediction.

    
    def get_logits(audio_array, processor, model, sampling_rate=16000):
        """
        Get the logits for the audio.
    
        For Unity: https://discussions.unity.com/t/how-to-use-logits-output-type/1540746
    
        :param audio_array: Input audio
        :param processor: Processor or feature extractor for the input.
        :param model: Actual model.
        :param sampling_rate: Samplig rate
        :return:
        """
        # On easy cases, inputs = (audio_array - audio_array.mean()) / sqrt(audio_array.var())
        inputs = processor(
            audio_array,
            return_tensors="pt",
            padding=True,
            sampling_rate=sampling_rate
        )
    
        with torch.no_grad():
            logits = model(inputs["input_values"]).logits
        return logits
    
    
    def decode_phonemes(
        ids: torch.Tensor,
        processor: transformers.Wav2Vec2Processor,
        ignore_stress: bool = False
    ) -> str:
        """
        CTC-like decoding.
        First removes consecutive duplicates, then removes special tokens.
        """
        # removes consecutive duplicates
        ids = [id_ for id_, _ in groupby(ids)]
    
        special_token_ids = processor.tokenizer.all_special_ids + [
            processor.tokenizer.word_delimiter_token_id
        ]
        # converts id to token, skipping special tokens
        phonemes = [processor.decode(id_) for id_ in ids if id_ not in special_token_ids]
    
        # joins phonemes
        prediction = " ".join(phonemes)
    
        # whether to ignore IPA stress marks
        if ignore_stress:
            prediction = prediction.replace("ˈ", "").replace("ˌ", "")
    
        return prediction
    
    
    def predicted_answer(logits, processor):
        """From a matrix of logits, get the best prediction match."""
        predicted_ids = torch.argmax(logits, dim=-1)
        my_prediction = decode_phonemes(predicted_ids[0], processor, ignore_stress=True)
        return my_prediction
    
    
    def run_analysis():
        """Run the model."""
        model, processor = get_model()
        sampling_rate = processor.feature_extractor.sampling_rate
        audio_array = get_audio_data()
        logits = get_logits(
            audio_array,
            processor,
            model,
            sampling_rate
        )
        prediction = predicted_answer(logits, processor)
        return prediction
    
    

    The model provider actually wrote a simple Python use case, and is pretty straightforward to use. The idea is to get an audio array from an audio file at 16 kHz, pass it to a processor, and then give it to the model. Let’s try to understand it.

    Input

    First, we pass the audio in a simple Wav2Vec2Processor. From what I tried, it only normalizes the input data, following a Z-score normalization. It may convert the audio array to 16 kHz. So from our initial audio array, we get an array of the same size, but with a better data range.

    Removing the processor does not seem to change the model outputs, so I guess it ships with an internal Z-score normalization.

    New, let’s pass the processed audio array as the model input.

    Output

    The model output is in the form of logits which is simply a matrix of probabilities for each timestamp, for each phoneme. In C#, this is a matrix of size float[n_timestamps][n_possible_phonemes]. This output matrix will be useful later, but for now let’s get the first predicted answer.

    Decoding the output

    The model output is a matrix of probabilities, lets get phonemes from there.

    We apply an Arg Max function, which gives use the index of the token with the hightest prediction ratio for each timestamp. We now have a list int[n_timestamps] (remember Arg Max selects the index).

    As the matrix was a prediction score for each phoneme, getting the index of the most probable phoneme means we get the id of the phoneme itself. To access the list of IDs, we can have a look at vocab.json. We can build a simple map int → string from it, and voilà, our output is a string of phonemes.

    Export to ONNX

    Now that we understood our model, let’s use it in Unity. The official, and best way, to integrate AI to Unity is through Unity Sentis. As it only accepts ONNX file format we will go with this open-source format. We will only export the model part (not the processor).

    Luckily enough, someone already did it, the only change is to set opset_version to 15 to be compatible with Sentis.

     
    def convert_to_onnx(onnx_model_name="wav2vec2.onnx"):
        """
        Convert a Wav2Vec2 model using a low-level method.
    
        https://github.com/ccoreilly/wav2vec2-service/blob/master/convert_torch_to_onnx.py
    
        :param string onnx_model_name: Name of the output file.
        """
        model, _ = get_model()
        audio_len = 250000
    
        model_input = torch.randn(1, audio_len, requires_grad=True)
    
        torch.onnx.export(
            model,                          # model being run
            model_input,                    # model input (or a tuple for multiple inputs)
            onnx_model_name,                # where to save the model (can be a file or file-like object)
            export_params=True,             # store the trained parameter weights inside the model file
            opset_version=15,               # the ONNX version to export the model to, 15 is recommended version for Unity
            do_constant_folding=True,       # whether to execute constant folding for optimization
            input_names=['input'],          # the model's input names
            output_names=['output'],        # the model's output names
            dynamic_axes={
                'input': {1: 'audio_len'},    # variable length axes
                'output': {1: 'audio_len'}
            }
        )
        print(f"Model saved to {onnx_model_name} successfully!")
    

    For some models, Hugging Face provides Optimum to do the conversion from safetensors to ONNX with a CLI. It was not available for this checkpoint of Wav2Vec2, so crafted some code.

    If you are lucky enough, running the code will yield a nice .onnx file.

    Checking if everything works: import the ONNX file in Python

    It is a good idea to see if, at this point, our model is still functional. We will import the ONNX file in PyTorch, run some demo data and try to get the phonemes.

    
    def run_from_onnx(audio_array, model_file="wav2vec2.onnx", check_model=False):
        """
        Run the Wav2Vec2.0 model from an ONNX file.
         
        :param list[float] audio_array: Input data, at 16 kHz.
        :param string model_file: Path to the ONNX file.  
        :param bool check_model: To apply supplementary check to the model file. 
        :return string: Predicted phonemes. 
        """
        audio_array = np.array(audio_array, dtype=np.float32).reshape((1, -1))
        if check_model:
            onnx_model = onnx.load(model_file)
            onnx.checker.check_model(onnx_model)
    
        ort_sess = ort.InferenceSession(model_file)
        logits = ort_sess.run(None, {'input': audio_array})[0]
        
        # The following is to pass from the logits to the prediction, not the hardest part
        _, processor = get_model()
        prediction = predicted_answer(torch.as_tensor(logits), processor)
        return prediction
    

    We can see our answer is the same. Another cool thing with the ONNX is it is easy to visualize, we can use neutron.app.

    Conclusion

    To recap, we exported a Python model that takes an audio as an input, and return it phonemes representation as the output. We used the ONNX format as an interchange file format. As our final goal is to integrate is within Unity, we will cover that in another post!

  • Rebuilding LWT’s API, part 2

    Hi there! Much work has been done and the new LWT REST API is now ready!

    For records, the initial suggestion, formatted with the help of ChatGPT, looked like the following (see part 1 for details):

    
    1. GET API Endpoints:
       - Get API Version: `GET /api/version`
       - Get Next Word to Test: `GET /api/test/next-word`
       - Get Tomorrow's Tests Number: `GET /api/test/tomorrow`
       - Get Phonetic Reading: `GET /api/text/phonetic-reading`
       - Get Theming Path: `GET /api/text/theme-path`
       - Get Texts Statistics: `GET /api/text/statistics`
       - Get Media Paths: `GET /api/media/paths`
       - Get Example Sentences: `GET /api/sentences/{word}`
       - Get Imported Terms: `GET /api/terms/imported`
       - Get Similar Terms: `GET /api/terms/{term}/similar`
       - Get Term Translations: `GET /api/terms/{term}/translations`
    
    2. POST API Endpoints:
       - Update Reading Position: `POST /api/reading/position`
       - Add/Update Translation: `POST /api/translation/{word}`
       - Increment/Decrement Term Status: `POST /api/terms/{term}/status`
       - Set Term Status: `POST /api/terms/{term}/status/set`
       - Test Regular Expression: `POST /api/regexp/test`
       - Set Term Annotation: `POST /api/terms/{term}/annotation`
       - Save Setting: `POST /api/settings`

    It featured 11 endpoints on GET and 7 on POST. The key of this approach was it was not a fundamental change but rather a reorganization of the already existing AJAX requests. In fact, GPT do not have access to all the app information, as most endpoints are simply a simplification of the explanation sentence, and were sometimes optimistic or simply wrong: for instance GET /api/text/statistics, how are we supposed to know that we get statistics for a subset of texts here?

    Finally, the chosen implementation is as follows:

    
    1. GET API Endpoints:
       - Get Files Paths in Media folder: `GET /media-files`
       - Get Phonetic Reading: `GET /phonetic-reading`
       - Get Next Word to Review: `GET /review/next-word`
       - Get Tomorrow's Reviews Number: `GET /review/tomorrow-count`
       - Get Sentences containing Any Term: `GET /sentences-with-term`
       - Get Sentences containing Registred Term: `GET /sentences-with-term/{term-id}`
       - Get CSS Theme Path: `GET /settings/theme-path`
       - Get Terms similar to Another One: `GET /similar-terms`
       - Get Term Translations: `GET /terms/{term-id}/translations`
       - Get Imported Terms: `GET /terms/imported`
       - Get Texts Statistics: `GET /texts/statistics`
       - Get API Version: `GET /version`
    
    2. POST API Endpoints:
       - Save Setting: `POST /settings`
       - Decrement Term Status: `POST /terms/{term-id}/status/down`
       - Increment Term Status: `POST /terms/{term-id}/status/up`
       - Set Term Status: `POST /terms/{term-id}/status/{new-status}`
       - Update Term Translation: `POST /terms/{term-id}/translations`
       - Create a New Term With its Translation: `POST /terms/new`
       - Set Text Annotation: `POST /texts/{text-id}/annotation`
       - Update Audio Position: `POST /texts/{text-id}/audio-position`
       - Update Reading Position: `POST /texts/{text-id}/reading-position`
    

    We have 12 enpoints on GET and 9 on POST. Apart from the necessary corrections since GPT’s output, the new system clearly specifies when the server can use previous data or has to build it from scratch.

    A previous “GPT” endpoint was GET /api/sentences/{word}, which would require a word to be passed as a text. Then the server would search all sentences containing this specific word, parsing and adapting it, which is expensive. When it is a new word, it is still done through GET /sentences-with-term, but for already registered words we use the word ID with GET /sentences-with-term/{term-id}, which is naturally translates as a SQL projection.

    Another feat of this implementation is to remove the need to pass sensitive data such as SQL. When conducting a word review (test), only the server has knowledge of which subset of words should be tested, and shows one of these word to the user. Hence, the subset selection was stored as a part of the page URL. When a user would require the next word to test it would trigger a page refresh. Now, the URL only stores the ID of the language, text or words to test, as well as a unique ID for the test type (language, text or word). In this way SQL injection is prevented.

    At the general scale, queries are now smaller and faster, as they support caching. The new architecture also allows to build tests with NPM, Chai and SuperTest, with 100% code coverage on GET.

    Aftermath

    The API is well integrated into LWT, the requests are smaller, the app faster and more secure. The API was released with LWT 2.9.0. Apart from a small fix in 2.9.1, the API had received no more fix (as of this time), proving its robustness. This RESTful system will now serve as a base to build a more dynamic LWT, paving the way to the future of the app.

    As a first try building a RESTful API, I consider it a great success, and a great initiation to the art. Happy language learning!

  • Rebuilding LWT’s API, part 1

    Rebuilding LWT’s API, part 1

    Hey everyone, for the next release of LWT I have been working on a complete refactor of the AJAX API that will enhance both security and the development for the future versions of LWT!

    The Issue

    All AJAX files in LWT
    A glance at all the different AJAX files composing LWT before unification.

    As LWT was growing, so were the AJAX calls and without any control. In the end, LWT as of 2.8.0 harbored a collection of PHP files with different formats and standards, making the API difficult the expand and to understand.

    Architecture Design

    The first step to tackle the API proliferation was to understand the needs. To do so I had to force in some format standard, and I ended up with a list of all my endpoints:

    It features the following interaction:
      * On GET, ``action_type`` can be:
        * ``version``: the API version and release date.
        * ``test_next_word``: next word to test.
        * ``tomorrow_tests_number``: number of tests for the next day.
        * ``phonetic_reading``: phonetic reading of a text.
        * ``theme_path``: theming path for a file.
        * ``texts_statistics``: various words statistics for each text.
        * ``media_paths``: paths of files and folders in the ``/media`` folder.
        * ``example_sentences``: list of sentences containing a word.
        * ``imported_terms``: list of imported terms through terms upload.
        * ``similar_terms``: similar terms to a given term.
        * ``term_translations``: get the list of term translations to edit it's
        annotation.
      * On POST, ``action`` can be:
        * ``reading_position``: ``action_type`` set to ``text`` of ``audio`` change
        the reading position for a text or its audio.
        * ``change_translation``, with values for ``action_type`` set to:
          * ``add``: add a translation for a new word.
          * ``update``: edit the translation of an existing word.
        * ``term_status``, with values for ``action_type`` set to:
          * ``increment``: increment or decrement the status of a term by one unit.
          * ``set``: set the status of a term.
        * For any other value, set ``action_type`` to:
          * ``regexp``: test if the regular expression is correctly recognized
          (no more usage in code base?).
          * ``set_annotation``: change the annotation value for a term.
          * ``save_setting``: save a setting.

    On server side, I had moved from a file-by-file approach to a name-by-name query format. Still better, since I can now access all queries from one file, but not perfect. Moreover, all interactions are now in JSON and I now which script is accessing what.

    First returns

    Even if not finished, the new API had the following impacts:

    • During term test, the music can play fully and the page does not reload.
    • Less dubious content is send, now everything is JSON. The feeds feature were sending raw HTML and JS code, potentially armful for the user.
    • I have a list of the endpoints

    Now, the system was good, but I needed something more robust and standard. I opted for the REST standard, easy to implement and well documented. I will describe that in a later post, so stay tuned!

  • A Solid Snake

    Let’s talk about a project I have been working on from a long time, and my first publicized Python package: pylinkage. It is a light-weight, quick code writing linkage designer and simulator, which was quite successful.

    Animation of a four-bar linkage

    Introduction

    Pylinkage dates back to 2018, when I started working on a project on leg mechanisms. At the time there was no simple solution to generate mechanical linkages in Python, which is why I decided to create my own open-source package. The idea is to simplify the process of creating and manipulating mechanical linkages from linkage definition to mathematical optimization.

    I worked on the package for two years to tune it to my specific needs. After my final examination, I revamped it completely, split it in two packages (pylinkage and leggedsnake) and published it in 2021.

    Not much happened until 2023, when I had some time to work on it again. I made it more standard and fixed several issues.

    For those not familiar with mechanical engineering, a linkage is a mechanism consisting of interconnected bodies called links joined together by joints or connections. This linkages transform motion between inputs and outputs links based on the linkage’s configuration.

    Features

    Pylinkage has revolutionized the conception of mechanic edges in Python. To understand the significance of pylinkage we have to look at the array of features it provides :

    • An easy linkage creation: it simplifies the process of building mechanical linkages by providing a straightforward API for generating and manipulating linkage in just a few seconds
    • Several joint configurations: the package provides support for multiple joint types which allows to create wide range of suitable for the specific needs.
    • Intuitive visualization: pylinkage provides powerful graphical visualization capabilities, allowing users to effortlessly demonstrate their linkage designs. This visualization aids in comprehending and fine-tuning the linkage simulations.
    • Linkage analysis: it is possible to analyze linkage characteristics, such as input-output relationships, link lengths, and motion trajectories, enabling users to optimize their designs effectively.

    Getting Started

    To start exploring pylinkage, beginners can refer to the comprehensive documentation, which provides step-by-step guidance on installation, basic usage, and advanced examples. The documentation includes code snippets, visualizations, and explanations, ensuring an immersive learning experience.

    pylinkage documentation screenshot

    Showcasing pylinkage’s Potential

    To illustrate pylinkage’s capabilities, we present a real-life examples of linkages design for a simple reciprocating engine mechanism. Leveraging the power of pylinkage, the user can define the linkage’s characteristics, simulate the motion, and analyze the resulting movement profiles, all within a Python environment.

    Kinematic view of TrotBot.

    Contributing to pylinkage’s Development

    Pylinkage is an open-source project that thrives on community contributions. Users are encouraged to actively participate in its development, sharing their ideas, reporting issues, and submitting pull requests. Collaboration strengthens the package’s potential and ensures its continual improvement and advancement.

    Conclusion

    With pylinkage, the world of mechanical linkage design becomes more accessible, dynamic, and enjoyable. By enabling users to construct and manipulate linkages in pure Python, pylinkage empowers engineers, designers, and hobbyists to unleash their creativity and explore the realm of mechanical motion transformations. Embark on your journey with pylinkage today and revolutionize your approach to mechanical linkage design.

  • The Future of LWT-fork and LUTE

    The Future of LWT-fork and LUTE

    As announced in a GitHub post, I’ve had contributed enough to the current version of LWT to consider that it was time to roll over a new project, namely LUTE. In this blog I’m detailing the reasons for this change and it’s implications.

    LWT logo
    LUTE main logo

    A Bit of History

    After one year of contributing to my fork of LWT, I felt that I had done most of the reasonable housekeeping and small feature insertions. It was quite fun, but the more I worked, the more I felt that I had to stop contributing to the app at some point.

    The original LWT stems from a great concept: learning by reading in your target language. However, the app had a chaotic development: it dates back to 2011, was made by a single person who is not a professional software developer, and never received contributions. The app expanded iteratively and was never intended to be production-ready. It has no test cases, no clean access to the database, etc.

    Those were feature I struggled to import, but at some point I had to admit that there were too many things to do, and that a refactoring was no longer the solution. I started posting about a new app and discussing with the community, until a very motivated person came and saved the day: jzohrab.

    He is an experienced software developer who decided to rebuild LWT from scratch at a time when I had my hands full fixing current LWT bugs. His intervention enabled me to pack up things with LWT while he was working on his software, enabling a smooth transition. As of May 2023, I consider both programs worth using.

    LUTE + LWT-fork Era

    LWT-fork is not a goner though. LUTE is a nice piece of software, but it hasn’t included all the features of LWT yet. More importantly, LUTE is jz’s software, and while we often collaborate together, our views can be diverging on some points. The organization we came into was that while he continues working on LUTE, I can freely solve problems on LWT, try new features, and import things I consider important to LUTE. With things as this, projects can benefit from each other, and users are free to use either of the two.

    As a personal ambition for LWT, I now try to keep backward compatibility at a maximum while experimenting with new features that can be ported to LUTE, as stated in a previous discussion. LWT is still a nice playground for me because it deals with the basics of full-stack engineering (good to get experience), while LUTE relies on frameworks. Both are very nice to contribute to, and I hope you will find them useful.

    If you liked any of these projects, please leave a star (LWT/LUTE stargazers) or fork them; that is our only reward as open-source developers!

  • leggedsnake: Retro-ingeneering Evolution

    leggedsnake: Retro-ingeneering Evolution

    leggedsnake is a cool projects I developed in Python. It purpose is to improve the design of linkages, specifically leg mechanisms. The idea is to create a linkage using pylinkage, and then using a dynamic simulation (with gravity and forces) to select a better linkage based on a genetic algorithm.

    A walking linakge
    Demo video of leggedsnake

    Inspirations/Other cool projects

    This project was strongly influenced by Theo Jansen’s stranbeesten. At some point, it became more important when I felt that the already existing leg mechanism optimizers such as the one from Amanda Ghassaei (and please visit here impressive website!) could get improved. I also got some inspiration from

    I also like to think how evolutionary path in the project led to solutions radically different from Evolution by Keiwan.

  • Maintaining LWT

    Maintaining LWT

    Hi there!

    In this post I’ll delve into a project which has taken a big part of my time for the last few months, the maintenance of an open-source language learning app: LWT.

    Original LWT logo

    LWT, standing for Learning With Texts, is a wonderful tool to learn text in your web browser, but it was quite abandoned 10 years ago, and it has not reached a larger audience since it was a bit difficult to use. So, I decided to blow the dust out of it, and started working on the GitHub community-maintained version. For those interested, the original project was posted at SourceForge.net.

    Basically, the target was to give the tool the audience it deserved. The main targets were:

    • Easy installation
    • Mobile compatibility
    • More secure

    Of course, many subquests came along the way, and I finished by doing much more than expected. In the end, it was a good introduction to full-stack development and project management as many things were to be done in a very short time. Now, I think that the project is much easier to maintain, so that independent developers can figure out how to do their own stuff without having a hard time understanding the code.

    Let’s talk a bit about the reason for which I started maintaining LWT. After I discovered LingQ, I found the tool very good, but also very expensive as many language learners are children. So, I decided to work on his open-source alter ego, LWT. Moreover, after spending time studying Japanese, I needed a robust tool to master the writing system, and reading books can turn out to be a pain if you cannot read words, let alone get the meaning.

    Another motivation was a personal challenge. I mostly did front-side development, and this was a much more ambitious full-stack project. I had no previous experience in PHP, or more generally, how a server internally works, so it was a good introduction to the domain. I could also try a new set of tools and utilities, which is why in my first commits things were a bit confused.

    Finally, I have been working on this project for almost a year now, and things are much better. I think I could tackle most of the urgent problems, and for the rest of the work, it will diverge from what LWT originally was and is much more related to where I want to go next. So stay tuned, as I may start on a successor for LWT from now on!

  • Launching the Display Cabinet!

    The purpose of this blog is to showcase some projects that can be of any interest to someone. I started doing it since I enjoyed starting new stuff, but when it was done it was falling to oblivion. Moreover, I do not have a lot of options for advertising GitHub projects for instance, so here is this blog to improve referencing.

    So, I hope you will enjoy going across the random projects of the Display Cabinet, and may you find something useful!

    Important Projects

    Here is a brief recap of the projects I present, or the one I would like to talk about:

    • LWT, standing for Learning With Texts. A web-based application to learn foreign languages based on reading.
    • LWT successor: not yet published, so can’t say much, see the GitHub post.
    • pylinkage: a linkage designer and simulator written in Python. View on PyPi and GitHub.
    • leggedsnake: A walking linkage (or “leg mechanism“) optimizer based on pylinkage. PyPi/GitHub.