Before the official release of iOS 10, there were no APIs that could help programmers make a speech to text mobile app. It was the reason why they had to use third party solutions which were either quite expensive or not as efficient as it was required.
In 2016, the situation changed. During the Worldwide Developers Conference, Apple introduced its official speech to text framework. Its special API designed for voice recognition and it allows developers from all over the world to use this official tool to create their own mobile app like Siri. It is worth to say that this is exactly the same framework Apple Inc. uses in its own speech to text iPhone app. As we all know, Siri is a personal assistant which understands human voice commands and can perform various tasks given by a user.
Those tasks can be setting an alarm, making a phone call to a particular person on the contact list, sending a message etc. Such functionality can be useful in any service-oriented application that’s why developers strive to implement such technology to get a WOW effect. So how to create an iOS speech to text app? In this article, we are going to review two methods: using the official Apple’s API and independent development.
Where Voice Recognition Can Be Used and How It Can Help Your Business Grow
Voice recognition can be much more than just speech transcription for startup owners. Sales departments of various companies use this technology to make their work faster and cost efficient. During a conversion between a sales department representative (SDR) and a customer, contact information pronounced by a client is processed by a system, recognized and saved in a Customer Relationship Management System (CRM).
We pronounce words much faster than we type them. Dictating text for an email and sending it to a particular person by a voice command saves time for improving other business processes. It can be even more effective when conditions do not permit to use fingers for typing, i.e. in frosty weather.
Advanced speech recognition systems can automatically identify a person by one’s voice using voice biometrics. It helps avoid requesting risky data that can be easily stolen while faking a voice is a much harder task for frauds. It is an especially useful feature for call centers of law and financial institutions when it is necessary to identify a person to define whether to provide him or her with the requested information or not.
Voice recognition is a great technology that can ensure confidential data safety, saves time for solving other tasks, and provides the possibility to create digital documents in tough conditions.
Development of a Voice Assistant Mobile App Using API by Apple
Voice assistant mobile app development is quite simple with the official API introduced by Apple Inc. At the beginning, you have to create a new project by selecting a “Single
View Application” on the menu. On the next step, you just fill in a name of your
project, team, organization and select some other options such as a programming
language and target devices.
Now you start designing the UI of your future app adding a “UILabel”, “UITextView”, and “UIButton”. Then assign variables for last two points in “ViewController.Swift” file. Import the speech framework to this file and adopt the “SFSpeech RecognizerDelegate” protocol.
Recorded voice is sent to Apple’s backend for processing and its recognition is performed on servers of this company. Any talking app like Siri has to get user permission for voice processing and authorize a person for the further recognition process. In order to request the authorization, use “getPermissionSpeechRecognizer()” method and assign a delegate “self” in “viewDidLoad()”.
Next, create a new function to start recording and handle speech recognition. Its algorithm is the following:
- First action checks whether recognition task is in process. If it is running, a new task is canceled.
- Second action creates an audio recording session. Here must be set such categories as recording and measurement mode.
- Third action initiates the recognition request.
- Next action checks whether a device contains an audio input to record a voice and reports an error if not.
- On this step, there is a check of recognition request approval.
- Next step is to make the recognition request report speech recognition results while recording.
- Then there is an action having a completion handler which is called whenever the recognition engine gets an input, improves recognition results or when the user cancels or stops the process, and returns results.
- Here is set of a variable whether the recognition is final and it equals false condition.
- The next step is to display a notification if there is a value for a result and if the result is final, the condition changes to true.
- Next action stops an audio input, the recognition request, and the recognition task while enabling the button which launches recording.
- If you want to let adding another audio input, you have to create a line with “let recordingFormat” function. The recognition will start right after adding an audio input.
- The final line has to make a device record.
Now you need to create a new function which would start recognition process by checking whether a microphone button is enabled at first. Then add a function which would check whether the device (audio engine) is in use. If so, the application has to turn the microphone off, disable the microphone button, and display a button that would start recording.
And the opposite, if the microphone receives voice data, then the application has to start recording and display a button that would stop recording.
This is a simple method how to make your own Siri app using official framework by Apple Inc. Now we move to particular voice assistant app development on a base of Android platform.
Full-Cycle Development of a Mobile Assistant Android App
Fast data processing is crucial for mobile assistant applications. It allows users to feel as they speak with a real person that provide them with useful information upon their request. In order to make a fast app, a mobile device has to have streaming communication with the main server. Technologies used voice assistant apps:
- Voice Biometrics;
- Noise Reduction Engine;
- Speech to Text Engine (STT);
- Tagging Server;
- Text to Speech Engine (TTS).
How it works
When user makes a request using a specific button, the app streams a request to the Main Server. The Main server transmits data to the STT Server which then detects text from the voice. This text returns to the Main server and then is sent to the Tagging Server to define what kind of information is requested. The Tagging server confers a tag, i.e. “local_news”, for requested information from received text and sends this tag back to the Main Server.
The Information Server receives a tag from the Main Server, generates the requested information, and sends it back. If there is a need for authentication, on this stage the Security Server may check it, i.e. voice biometrics authentication. On the final stage, the Main Server receives data, turns it into text response, graphics or speech using TTS Server, and returns to the mobile device.
In order to build apps that are like Siri, you have to create a custom web view that can be scrolled by user and move automatically when a new callout appears. The first step is to make a new class extended from the ordinary Android web view. This class has to include a constructor and override OnDraw function. In addition, create new functions to the new class: one of them will initialize the class and the other one will add a new callout.
The second function contains such parameters as “isResponse” and “message”. The first parameter defines whether the message is actually an app response or not, makes a callout look different, and scrolls a dialogue.
Reducing data size
Speech recognition requires much power and data storage resources. Audio data compression increases the speed of data transferring and, as a result, the whole app works faster.
Types of data compression:
- Lossless data compression;
- Lossy data compression.
Lossless compression type means reducing data size without decreasing initial quality of data. It is useful in the case where the quality matters, i.e. such audio files as music.
Lossy compression type means reducing data size in a code with losing original quality but keeping it possible to recognize. It helps reduce the amount of data much more than using lossless data compression. It is useful in case where the quality does not matter but a sense does, i.e. such audio files as a voice record.
It is obvious that voice recognition technology has a tremendous potential for usage and we believe that on this stage, most companies have not realized yet how it can influence business development. In this article, we have learned how to make an app like Siri and, hopefully, it will help improve the effectiveness of your business. In case you have any questions or just wondering how Lunapps team can help with your app development, please contact us.