Browser-use installation and configuration guide

In the digital age we live in, there are millions of websites that generate a massive amount of content every day. That’s why it’s very common to feel overwhelmed by so much information available at the click of a button. This information overload is further accentuated when we want to sift through that information. When that filtering is done manually, it’s not only a daunting task, but also completely inefficient.

For this reason, and to address this problem, specialised web data extraction tools have been around for some time. These include BeautifulSoup, which makes it easy to obtain information from static sites by analysing their HTML, and others such as Selenium or Playwright, which allow you to automate interactions with a browser to handle dynamic sites.

However, the emergence of Artificial Intelligence and LLMs has revolutionised this entire paradigm. Thanks to Browser-use, we can now achieve things that until recently were only in our imagination. This tool enables the integration of LLM models to automate interaction with web pages. In this way, it allows us to make significant progress in the process of obtaining, filtering, and analysing information from the web.

Now that we know the problem Browser-use aims to solve, let’s see how to use it and how it can transform the way we extract and analyse data.

What is Browser-use?
Browser-use: Installation, configuration and example
1. Installation and basic configuration
  1. Installing Browser-use with Gemini
  2. First steps to configure Browser-use
How to extract data with Browser-use
Conclusion

What is Browser-use?

Browser-use is a web scraping tool that allows you to automate the web browsing process in a natural way. The procedure is very similar to what a human would do by integrating language models (LLM).

In this sense, Browser-use completely changes the way we approach web scraping. Previously, this process was carried out in different ways. Either through direct interaction with the DOM, or through CSS selectors that we had to know beforehand in order to sequentially execute the commands we wanted, just as a human would.

This caused many headaches. Not only because it required manually researching the structure of the website in question and implementing ad-hoc code that interacted with the specified elements. In addition, in most cases, websites are modified from time to time. These changes in element names or DOM restructuring cause our code, which is sensitive to these small modifications, to stop working and fail to execute the tasks we assign to it.

However, Browser-use works differently. LLMs interpret the content of the page and make decisions about the steps they believe are appropriate to achieve the goal proposed in the prompt. This is incredible because Browser-use not only automates interaction with the web. It also understands the context and purpose of each action to be performed and modifies it without the need for human interaction.

After analysing this type of technology from a theoretical point of view, we will see how to install and configure it for use in our projects.

Browser-use: Installation, configuration and example

Next, we will look at a practical example of how to automate the process of downloading a table from a public Kaggle page.

First, we will install and configure Browser-use as well as all the necessary dependencies. Then, we will look at the code we need to create to pass a prompt to Browser-use so that it searches for and downloads the specific table.

Installation and basic configuration

After analysing what Browser-use is and reviewing the project we will carry out, we will proceed with its installation and configuration. At this point, and throughout the article, we will try to make the explanation as practical as possible and focus on creating code to download a random table from Kaggle.

Installing Browser-use with Gemini

To keep things simple, we will show you how to install Browser-use using a Gemini API. This API is free and can be easily obtained from the official Google website. To get it, you just need to register with an email address.

However, it is also possible to implement it with Ollama. In this case, it would give us even more independence because we would no longer need third-party services and would run everything locally. However, to avoid adding complexity, we will address this case in another more advanced article.

The first step, as always, is to install the necessary dependencies for it to work properly. In this case, we will need browser-use and playwright. In addition, since we will be using the Gemini API, we also need langchain-google-genai.

To install them, you can run everything with a single command:

pip install browser-use playwright langchain-google-genai

On the other hand, since Browser-use utilizes playwright, we will also require Chromium. We will therefore proceed to install it with the following command:

playwright install chromium

With this, we now have all the necessary dependencies to run Browser-use using Gemini as the LLM engine.

First steps to configure Browser-use

The next step will be to configure it correctly so that we can run it without any problems. To do this, the first thing we need to do is set the API key in our variables file. This API key is generated once we have registered with Google AI Studio.

Screenshot of the Google AI Studio platform for generating an API key

If we are going to use a database, such as Postgres, to store the information we obtain, we must specify the necessary credentials for the connection.

In this case, we will only download a file from a website. Therefore, it will not be necessary to generate a database. With this in mind, the .env file should contain the following:

# API
GOOGLE_API_KEY=<Your_api_key>

Now we have everything we need to start developing the code.

How to extract data with Browser-use

After installing the dependencies and configuring the repository, it’s time to start generating the code.

Configure the browser

The first thing we need to do is set up the browser session. We do this using the BrowserSession class, which manages the automated browsing environment.

There are different settings that can be customised in this section. Some of them are, for example:

Execution mode (headless). With this parameter, we can specify whether we want the browser to run in the background (True) or be visible (False).
Browser window size (viewport). Very useful for simulating different devices.
Data storage directory (user_data_dir). The specified path will be where cookies, sessions, and other persistent data necessary during execution are stored.

from browser_use import BrowserSession

browser_session = BrowserSession(
    headless=True,
    viewport={'width': 964, 'height': 647},
    user_data_dir='~/.config/browseruse/profiles/default',
)

Identify sensitive data

Once we have configured the browser, the next step is to define the task we want to perform. To do this, we first need to identify the data we are going to use that is sensitive (such as passwords and usernames). For security and privacy reasons, it is not recommended to send this data directly to the language model (LLM).

Browser-use provides a secure way to handle this situation. It offers a variable called sensitive_data, a dictionary that allows us to establish all the data we want to protect.

In this case, for security reasons, we do not want the model to see our email, password, and dataset. Therefore, we generate the following code:

sensitive_data = {
    'kaggle_email': kaggle_email,
    'kaggle_password': kaggle_password,
    'dataset': dataset_name
}

These values are referenced within the prompt using special syntax (such as kaggle_email, kaggle_password, and dataset), without exposing sensitive information to the language model. Of course, these values must be reflected in the .env file so that they can be executed.

Set the prompt

Once we have configured the browser and set the sensitive data, the next step is to define the task we want to execute.

To do this, we will detail the prompt in a variable that we will call task_description, which will contain the task prompt. In addition, we will use the syntax shown in the previous section to reference the sensitive data that we do not want to pass to the model.

In our specific example, we will define the variable as follows:

task_description = (
"Navigate to https://www.kaggle.com/, sign in whith kaggle_email and kaggle_password"
"Search the Dataset named dataset in kaggle and download it as zip file."
"Finally log out of Kaggle and wait 5 seconds before continuing."
)

Initialise the model

Since we will be using Gemini as our LLM model, we must configure it as follows. First, we must load the variables we defined earlier in the .env file, as that is where we have set our API key.

Next, we initialise the model with the following code:

from langchain_google_genai import ChatGoogleGenerativeAI
from dotenv import load_dotenv

load_dotenv()

gemini_llm = ChatGoogleGenerativeAI(model='gemini-2.0-flash-exp')

Configure the agent

Finally, after configuring the browser, setting up confidential data, generating the prompt, and initialising the LLM model, you need to create the agent. Before doing so, it is important to know what it is and what its function is.

The agent is what actually interacts with the web pages. In other words, it uses the LLM models we have provided to make decisions in real time. These decisions will be based on the content it finds to achieve the goal specified in the prompt.

Now that we know what an agent is, let’s define it in our code:

from browser_use import Agent

agent = Agent(
      task=task_description,
      llm=gemini_llm,
      max_actions_per_step=8,
      use_vision=True,
      browser=browser_session,
      generate_gif=False,
      sensitive_data=sensitive_data,
)

As you can see, in addition to the task to be performed, the llm model used, the browser configuration, and the sensitive_data we need, there are other interesting parameters that can make the agent more robust:

max_actions_per_step. Sets the limit of actions for each step of the agent’s execution.
use_vision. Allows the agent to interpret images. This is particularly useful on modern websites.
generative_gif. Option to generate a GIF recording of the process. This feature is useful in some cases, but in our specific example it does not add value.

Run the code

With all the configuration ready, the last thing left to do is run the task. To do this, we simply need to add the run() method at the end of our script. This will start the whole process when it is executed.

agent.run()

Final structure

Finally, all that remains is to enter all the code into a main() method and set a trigger so that it runs correctly when we call the script. The final code would look like this:

from browser_use import BrowserSession, Agent
from langchain_google_genai import ChatGoogleGenerativeAI
from dotenv import load_dotenv
import os

def main():
    load_dotenv()

    kaggle_email = os.getenv("KAGGLE_EMAIL")
    kaggle_password = os.getenv("KAGGLE_PASSWORD")
    dataset_name = os.getenv("KAGGLE_DATASET")

    browser_session = BrowserSession(
        headless=True,
        viewport={'width': 964, 'height': 647},
        user_data_dir='~/.config/browseruse/profiles/default',
    )

    sensitive_data = {
        'kaggle_email': kaggle_email,
        'kaggle_password': kaggle_password,
        'dataset': dataset_name,
    }

    gemini_llm = ChatGoogleGenerativeAI(model='gemini-2.0-flash-exp')

    task_description = (
        "Navigate to https://www.kaggle.com/, sign in with kaggle_email and kaggle_password. "
        "Search the dataset named dataset in Kaggle and download it as a zip file. "
        "Finally, log out of Kaggle and wait 5 seconds before continuing."
    )

    agent = Agent(
        task=task_description,
        llm=gemini_llm,
        max_actions_per_step=8,
        use_vision=True,
        browser=browser_session,
        generate_gif=False,
        sensitive_data=sensitive_data,
    )

    agent.run()

if __name__ == "__main__":
    main()

Now, we can run the code simply by launching the script from our terminal.

Conclusion

As we have seen, Browser-use is a tool that takes web scraping to the next level. It can quickly and easily integrate language models (LLMs) to interact with web pages, bringing flexibility and dynamism to the entire process.

In addition, this new tool opens up a wide range of possibilities in terms of task automation. With just a few lines of code, we can harness the full potential of these models, achieving a robustness that is very difficult to obtain with other tools.

In short, Browser-use has revolutionised the way web scraping is done. This tool promises to be a very powerful alternative for automating processes that require this kind of integration.

So much for today’s post. If you found it interesting, we encourage you to visit the Software category to see similar articles and to share it in networks with your contacts. See you soon!