30 September 2024
Android has supported traditional machine learning models for years. Frameworks and SDKs like LiteRT (formerly known as TensorFlow Lite), ML Kit and MediaPipe enabled developers to easily implement tasks like image classification and object detection.
In recent years, generative AI (gen AI) and large language models (LLMs), have opened up new possibilities for language understanding and text generation. We have lowered the barriers for integrating gen AI features into your apps and this blog post will provide you with the necessary high-level knowledge to get started.
Before we dive into the specificities of generative AI models, let’s take a high level look: how is machine learning (ML) different from traditional programming.
A key difference between traditional programming and ML lies in how solutions are implemented.
In traditional programming, developers write explicit algorithms that take input and produce a desired output.
Machine learning takes a different approach: developers provide a large set of previously collected input data and the corresponding output, and the ML model is trained to learn how to map the input to the output.
Then, the model is deployed on the Cloud or on-device to process input data. This step is called inference.
This paradigm enables developers to tackle problems that were previously difficult or impossible to solve with rule-based programming.
Traditional ML on Android includes tasks such as image classification that can be implemented using mobilenet and LiteRT, or pose estimation that can be easily added to your Android app with the ML Kit SDK. These models are often trained on specific datasets and perform extremely well on well-defined, narrow tasks.
Generative AI introduces the capability to understand inputs such as text, images, audio and video and generate human-like responses. This enables applications like chatbots, language translation, text summarization, image captioning, image or code generation, creative writing assistance, and much more.
Most state of the art generative AI models like the Gemini models are built on the transformer architecture. To generate images, diffusion models are often used.
At its core, an LLM is a neural network model trained on massive amounts of text data. It learns patterns, grammar, and semantic relationships between words and phrases, enabling it to predict and generate text that mimics human language.
As mentioned earlier, most recent LLMs use the transformer architecture. It breaks down input into tokens, assigns numerical representations called “embeddings” (see Key concepts below) to these tokens, and then processes these embeddings through multiple layers of the neural network to understand the context and meaning.
LLMs typically go through two main phases of training:
1. Pre-training phase: The model is exposed to vast amounts of text from different sources to learn general language patterns and knowledge.
2. Fine-tuning phase: The model is trained on specific tasks and datasets to refine its performance for particular applications.
Gen AI models come in various sizes, from smaller models like Gemini Nano or Gemma 2 2B, to massive models like Gemini 1.5 Pro that run on Google Cloud. The size of a model generally correlates with the capabilities and compute power required to run it.
Models are constantly evolving, with new research pushing the boundaries of their capabilities. These models are being evaluated on tasks like question answering, code generation, and creative writing, demonstrating impressive results.
In addition some models are multimodal which means that they are designed to process and understand information from multiple modalities, such as images, audio, and video, alongside text. This allows them to tackle a wider range of tasks, including image captioning, visual question answering, audio transcription. Multiple Google Generative AI models such as Gemini 1.5 Flash, Gemini 1.5 Pro, Gemini Nano with Multimodality and PaliGemma are multimodal.
Context window refers to the amount of tokens (converted from text, image, audio or video) the model considers when generating a response. For chat use cases, it includes both the current input and a history of past interactions. For reference, 100 tokens is equal to about 60-80 English words.For reference, Gemini 1.5 Pro currently supports 2M input tokens. It is enough to fit the seven Harry Potter books… and more!
Embeddings are multidimensional numerical representations of tokens that accurately encode their semantic meaning and relationships within a given vector space. Words with similar meanings are closer together, while words with opposite meanings are farther apart.
The embedding process is a key component of an LLM. You can try it independently using MediaPipe Text Embedder for Android. It can be used to identify relations between words and sentences and implement a simplified semantic search directly on-device.
Parameters like Top-K, Top-P and Temperature enable you to control the creativity of the model and the randomness of its output.
Top-K filters tokens for output. For example a Top-K of 3 keeps the three most probable tokens. Increasing the Top-K value will increase the randomness of the model response (learn about Top-K parameter).
Then, defining the Top-P value adds another step of filtering. Tokens with the highest probabilities are selected until their sum equals the Top-P value. Lower Top-P values result in less random responses, and higher values result in more random responses (learn about Top-P parameter).
Finally, the Temperature defines the randomness to select the tokens left. Lower temperatures are good for prompts that require a more deterministic and less open-ended or creative response, while higher temperatures can lead to more diverse or creative results (learn about Temperature).
Iterating over several versions of a prompt to achieve an optimal response from the model for your use-case isn’t always enough. The next step is to fine-tune the model by re-training it with data specific to your use-case. You will then obtain a model customized to your application.
More specifically, Low rank adaptation (LoRA) is a fine-tuning technique that makes LLM training much faster and more memory-efficient while maintaining the quality of the model outputs. The process to fine-tune open models via LoRA is well documented. See, for example, how you can fine-tune Gemini models through Google AI Studio without advanced ML expertise. You can also fine-tune Gemma models using the KerasNLP library.
With ongoing research and optimization of LLMs for mobile devices, we can expect even more innovative gen AI enabled features coming to Android soon. In the meantime check out other AI on Android Spotlight Week blog posts, and go to the Android AI documentation to learn more about how to power your apps with gen AI capabilities!