AI Data Processing Block | Smart Automation Without Code

Table of Contents

In the ever-evolving landscape of AI, our team embarked on a journey to enhance our platform by incorporating a practical AI block feature. The possibilities seemed limitless, ranging from facial recognition for attendance monitoring to extracting vital information from images, such as analyzing the quality and detecting diseases in plants through image analysis. After successfully validating these use-cases through a Proof of Concept (POC), we were eager to integrate this AI capability into our platform.

Choosing the right AI API was a critical decision, and after careful consideration, we opted for the OpenAI ChatGPT API. Our exploration led us to this choice, as alternatives like Bard fell short in delivering satisfactory responses, and Gemini had not yet opened up its API to the public. OpenAI's GPT API demonstrated superior performance in handling diverse tasks. With a clear direction, we set our sights on leveraging the power of the OpenAI GPT API to bring our AI block feature to life.

Text Analysis Implementation

For text analysis, we've adopted the straightforward approach of employing the gpt-3.5-turbo model. Input for text analysis is facilitated through dedicated text blocks, allowing users to provide instructions for the subsequent evaluation. We've strategically implemented constraints, capping the maximum input length at 500 characters, and limiting the output to 100 characters for succinct and manageable results.

Image Analysis Implementation

For our image analysis feature, we've incorporated the powerful 'gpt-4-vision-preview' model, an extension of GPT-4 Turbo with advanced image comprehension capabilities. Currently, it references 'gpt-4-1106-vision-preview.' Users can seamlessly upload images using our existing file block, which is then directed to S3.

Here's an overview of our image analysis workflow:

File Upload and Storage: Users utilize the existing file block to upload images, which are then stored on S3, ensuring efficient and reliable file management.
‍Microservices Architecture: Leveraging microservices, the AiAnalysisService takes charge when the Ai block is triggered. This architecture allows for modular and scalable image analysis operations.‍
Cost Optimization Checks: Before processing, our system performs a series of cost optimization checks. These checks, which contribute to efficient resource utilization, will be discussed in detail later.‍
Image Processing: Images are retrieved from S3, converted to base64 format, and then processed by the 'gpt-4-1106-vision-preview' model. It's worth noting that our system adheres to GPT's 20 MB limit for image file sizes, allowing inputs of up to 10 MB in base64 format.‍
Supported File Types: Our image analysis feature supports a variety of file types, including jpeg, jpg, png, webp, and gif.

Optimizations for Multiple Images:

Downloading Unique Images: To enhance efficiency, our system downloads only unique images from the S3 bucket, minimizing redundancy and conserving resources.‍
Parallel Image Downloads: It concurrently downloads multiple images leveraging the benefits of concurrency to expedite the overall image analysis process.

Understanding Image Analysis Cost Calculation

The cost of image analysis is calculated based on tokens, factoring in the image size and the detail option. For images with detail: low, the cost is a fixed 85 tokens. For detail: high images, they are resized to fit within a 2048 x 2048 square, maintaining aspect ratio, and then scaled down to a minimum of 768px on the shortest side. The total token cost is calculated based on the number of 512px squares required, with each square costing 170 tokens, plus an additional 85 tokens. Examples illustrate the token cost for different image sizes and detail options, ensuring a transparent and scalable pricing structure.

Non-Empty Instruction Requirement

In the AI block, instructions are mandated to be non-empty, requiring users to provide either static or dependency-based guidance. This ensures purposeful utilization, preventing empty instruction scenarios. For tasks like extracting a restaurant bill amount with GST considerations, users articulate specific instructions such as, "Extract the total bill amount from the provided @restaurantFileBlock." This approach fosters a nuanced and effective use of the AI block, emphasizing clarity and user intentionality, while minimizing the potential for ambiguous or redundant queries.

Rate Limiting for Workplace

To manage usage effectively, we've implemented rate limiting for each workplace, currently set at 100 triggers. This cautious approach during the initial stages allows users to explore the feature while maintaining a balanced workload on our AI resources.

Lambda Authorizer Restrictions

To ensure controlled access to the AI feature, only individuals with authorized access to the app are eligible to utilize it. In the context of public sharing, if an app has public sharing enabled and a user lacks the necessary authorization, attempts to use the AI block will result in an error. This restriction applies specifically to users who do not possess the requisite authorization credentials.

Additionally, users are barred from utilizing the AI feature in the context of embedded and publicly shared apps. This stringent approach ensures that the AI capabilities remain within the intended user base and aligns with the access control policies established for the application.

Caching Strategy

Implementing an efficient caching mechanism has been a pivotal aspect of our system, particularly for instances where the same request with identical field IDs for a form instance is encountered. This strategy significantly optimizes our response time and resource utilization. We've designed the caching system to recognize when a request shares the same field ID for a form instance. This ensures that redundant queries are avoided, optimizing the overall processing flow. This caching also works with image queries.

Triggering only when all dependencies are fulfilled

Within the AI block, intricate linkages may exist, involving multiple lines of text or numerous images. To maintain a streamlined process, the AI block is designed to trigger only when all associated dependencies are filled. If any of the required blocks are left empty, the AI block prompts the user to complete all missing dependencies before proceeding. This strategy not only enhances the effectiveness of the AI feature but also contributes to a more streamlined and cost-conscious user experience.

Setting detail parameter of ai model

The detail parameter in the AI model offers three options: low, high, or auto, providing control over the image processing and textual generation. The default setting is auto, where the model dynamically chooses between low and high based on the image input size. In low mode, a 512px x 512px low-res version of the image is received, utilizing a budget of 85 tokens for faster responses in scenarios not demanding high detail. Conversely, high mode allows the model to examine the low-res image and generate detailed 512px squares, using a total token budget of 129 tokens for more intricate representations.

Output Optimization: Restricting to 100 Characters

To optimize cost efficiency, the output from the model has been capped at a maximum of 100 characters. This restriction aligns with the token-based charging system of GPT, allowing users to receive concise and relevant responses while managing and predicting token consumption effectively.

Future Scopes for AI Block Enhancement: Expanding to Offline Plugin and Diverse File Formats

Looking ahead, our AI block envisions a series of enhancements to broaden its utility. One key area of focus is extending support to an offline plugin, enabling users to harness AI capabilities even without an active internet connection. Additionally, we aspire to diversify the file formats handled by the AI block, including but not limited to PDFs, videos, and signature blocks. These expansions aim to provide users with a more comprehensive and versatile AI experience, catering to a wider array of data types and user scenarios.

‍

This blog post was originally published here.