Chrome Extension A Multifunctional Wrapper for the OpenAI API

The fast-growing OpenAI infrastructure has opened up vast possibilities for all technical professionals. Its applications are virtually unlimited and can be easily integrated into a wide range of platforms and use cases.

With a background in browser extension development, I aimed to combine the convenience and functionality of extensions with the powerful capabilities of the OpenAI API. The result is Outmeet - Chrome Extension that captures and transcribes microphone and active tab audio and responds based on user custom prompts. Here, I want to discuss the technical side of its implementation.

Technically, for building the extension, I use React and Tailwind, with Webpack bundling everything together. The architecture of this extension can be divided into several logical components, which I want to describe as separate parts of this system.

Audio Capturing

To obtain the full context of the conversation, audio capturing is required not only from the user's microphone but also from the active tab. These are two distinct aspects, which I will describe in detail below.

After the implementation of Manifest v3 in Chrome Extensions, the best way to obtain a stream from the user's microphone is to use a pinned tab. The functionality of the extension allows for the creation of a tab with an internal page where audio streaming can be obtained using WebRTC.

The pinned feature adds aesthetic value and significantly reduces the chance of the page being closed during an active session. It's also worth noting that the user must grant permission to use the audio devices, so this aspect should be taken into consideration. This is the same method used by industry giants like Loom.

chrome.tabs.create({
  url: chrome.runtime.getURL('index.html'),
  pinned: true,
  active: false
});

Chrome Extension tab capturing permissions have a particular limitation. It can only be triggered after the user interacts with the extension, such as by clicking the extension's action button. This means that the extension only gains access to the audio stream on the page where the user initiated it.

OpenAI API Integration

Service worker, as part of the extension, is an excellent place to interact with the OpenAI Node API Library. Communication between all other parts takes place through the default chrome.runtime messaging interface. Challenges may arise when sending audio, but this is resolved by sending it as a data URL:

//tab
recorder.getDataURL((dataURL) => {
  chrome.runtime.sendMessage(null, {dataURL})
});

//background
chrome.runtime.onMessage.addListener(({ dataURL }) => {
  fetch(dataURL)
    .then((res) => res.blob())
    .then(async (blob) => {
      const file = await OpenAI.toFile(blob, `audio.webm`, { type: 'webm' });
      return openai.audio.transcriptions
        .create({
          model: 'whisper-1',
          file
        })
    })
});

Content scripts

The content script is an excellent way to communicate with the user and interact with any page. In the case of this extension, the communication happens through a widget that is rendered on the active tab. I don't think additional explanations are needed here, except for how to use Tailwind on a page without breaking its styles. This was a challenge I had to overcome. The solution is very simple render the widget in the Shadow DOM.

document.onreadystatechange = function () {
  if (document.readyState === 'complete') {
    const shadow = document.createElement('div');
    
    shadow.setAttribute('id', 'outmeet');
    shadow.attachShadow({ mode: 'open' });

    const root = document.createElement('div');
    shadow.shadowRoot.appendChild(root);

    const stylesheet = document.createElement('link');
    stylesheet.setAttribute('rel', 'stylesheet');
    stylesheet.setAttribute('href', chrome.runtime.getURL('tailwind.css'));

    stylesheet.onload = () => {
      createRoot(root).render(
          <Widget />
        );
    };

    shadow.shadowRoot.appendChild(stylesheet);
    document.body.after(shadow);
  }
}

Another strategically important part is displaying content on the web page side. It's essential to understand that all settings, API Keys, and recorded sessions are stored exclusively on the Chrome Extension local storage (so after its removal all data is erased). However, session content is rendered onto a public page for display.

{
  "content_scripts": [
    {
      "js": ["js/widget.bundle.js"],
      "matches": ["<all_urls>"]
    },
    {
      "js": ["js/content.bundle.js"],
      "matches": ["https://outmeet.dev/board*"]
    }
  ],
}

This approach ensures page visits without storing data on the server side and provides some space for Google Analytics.

I hope this brief article helps you gain a better understanding of how to wrap the OpenAI API into a Chrome Extension interface and how the Outmeet extension works. Feel free to leave comments I'll be happy to answer any questions.