The Rules of How I Conducted My Massive AI Search Experiment

6 min readJul 24, 2024

I recently completed a broad experiment on how artificial intelligence impacts online search. I’ll post my observations and conclusions in another article, but I also wanted to transparently share my process for conducting the experiment.

This is a list of questions I asked, platforms I searched on, rules I followed, and results I collected. These are the dull background details that helped me learn about the current state of artificial intelligence and online search.

The Platforms

Search Engines

Google is the dominant search engine, but it’s not alone. Their recent algorithm leak has left some people skeptical and looking for another search option. Bing is similar in many ways, but newer competitors like DuckDuckGo and Yep are more privacy focused.

Artificial Intelligence

*Requires an account to search. Copilot is limited to 4 searches without an account. However, I got around that by accessing through an incognito window and refreshing after every search.

I also started the experiment with Exa (formerly Metaphor), but it wasn’t very helpful, so I dropped it from the experiment.

Voice Assistants

I intended to use Microsoft’s Cortana assistant before realizing that they’ve discontinued it in favor of Copilot. Who says Google is the only one who unnecessarily sunsets everything?

Social Search

*Requires an account to search.

This list obviously excludes most of the popular social channels like Facebook and Instagram. But those didn’t make sense for a search engine replacement. I was also curious to see if I could include other platforms like Medium or Tumblr, but they didn’t make sense given my parameters.

The Questions

I broke up the types of queries I fed to these platforms into six different categories. This was designed to test different capabilities, and predictably, the tools handled them differently.

I wrote most of the questions intentionally vague or misleading (including an entire category seeking to trip up the responses). The goal was to see how they reacted and whether the tools could make assumptions, think a step further, or make human-like connections.

For example, asking about the capital of Georgia could mean either the state of Georgia (Atlanta) or the country of Georgia (Tbilisi). Both answers are equally valid, and I got both. However, some of the tools understood the uncertainty and gave both answers.

Control

Where is the capital of Georgia?
Who wrote War and Peace?
What is the square root of 81?
When was the Battle of Hastings?
What is the weather outside?

Opinion

Who is better: Lebron or MJ?
Who will win the 2024 US presidential election?
What is a good nonprofit to donate to?
What is MLK most known for?

Actions

Tell me a joke that will make me laugh.
Summarize the Godfather.
Write an email to my boss about coming in sick.
Suggest a nice restaurant for dinner near me.
Translate lorem ipsum.

Business

What is GreenMellen?
What services does GreenMellen offer?
Who works at GreenMellen?
What are the best digital marketing agencies in Marietta, Ga?

FYI: GreenMellen is the digital marketing agency I work for. I used them specifically because they’re relatively obscure, but also well established. I also know enough about us to see what information is inaccurate.

Misleading

What are the five rules of Fight Club?
Who was the first female US president?
When did the Detroit Lions win the Super Bowl?
How did Paul McCartney die?

The Rules

Searching in an incognito window to objectify results.
This wasn’t possible for a few tools that require a logged-in account to use.
Use the exact wording and formatting for each different platform.
Copy the exact response and any other relevant information.

The Results

In case you’re curious, I recorded every search and result into this spreadsheet. I tried to capture as much raw data as possible during the process without overwhelming my internal servers.

Here are some of the statistics on the scope of this experiment:

26 questions
In 6 categories
19 platforms
5 search engines
7 AI tools
3 voice assistants
4 social platforms
520 total queries

Highest Auto-Response Rates

I counted the number of times a search engine automatically answered a question in a “featured snippet” as pulled from a web page.
Google was the most with nearly 70% of responses featuring an automatic answer. Bing was immediately behind with 61% of searches.
Most others had far fewer: DuckDuckGo (30%) and Yahoo (11%). A new search engine, Yep, never included a featured answer — which is part of their business model.
These auto-answers are helpful but also drain traffic away from websites, and are intended to keep users on the search engine as long as possible.

Readability of Chatbots

I counted the average words in each chatbot response. (Or rather, I had ChatGPT help me write a spreadsheet formula to do that automatically.)
Perplexity gave the longest answers, at about 192 words. ChatGPT was the shortest, with 97 words per query.
Claude and Microsoft’s Co-pilot had nearly identical length responses at 132.1 and 132.7, respectively.
I also calculated the average reliability score (based on the Fleishman score). The higher the number, the easier it is to read.
All the bots scored similarly, but Microsoft’s Co-Pilot had a narrowly higher score (47), and ChatGPT surprisingly had the lowest (37) readability score.

Voice Assistant Speaking Time

Not surprisingly, AI voice assistants gave much shorter answers than the text tools.
Their word counts were in the 20–50 range, with Siri offering an average of 25 words per answer and Google’s Assistant an average of over 50.
I calculated the length of each response based on an average speaking speed. Siri’s clocked in at about 9 seconds, Alexa at 10 seconds, and Google at around 18 seconds.

Social Media Statistics

Measuring the social media platforms was like comparing apples to walruses. These systems are vastly different and even have different stats to collect.
Therefore, I captured first-result video views on YouTube and TikTok, the number of responses to a question on Quora, and the upvotes a post got on Reddit.
On average, TikTok’s first video results saw over half a million views (551k), compared to 407k on YouTube.
The first result on Quora got an average of 36 responses. Reddit’s first results averaged about 3,400 upvotes.

Changes During the Experiment

This technology is moving quickly — to the point where some changes happened during the process of this experiment. I pivoted as needed, but also know that many more changes are coming — which may or may not make this experiment quickly irrelevant. Only time will tell.

ChatGPT 4o was released
Apple Intelligence was announced
Google’s search algorithm leak
Google expanding reach of AI-driven search overviews
Launch of hundreds of AI companies and technologies we haven’t heard of yet.

Limitations

I’m no scientist or statistician so this experiment is far from perfect. I tried to make it as objective and respectable as I knew how to.
I tried to reach a decent number of data points to keep from skewing the results. But I’m only one person with limited time, so covered as much ground as I could.
This experiment was done from one location (metro Atlanta) in a single language (English), so it’s far from a comprehensive representation of the global or national impact of AI.
The rapid changes in this technology make any assessment of how it’s used short-lived. But I hope that this stands as a decent snapshot of this moment in digital search.

Using AI to Assist the Experiment

I asked both ChatGPT 3.5 and Google Gemini for help framing the questions to be used in the experiment. I didn’t end up using any they suggested, but it did help me think of the final list used.
ChatGPT was helpful in writing some of the Excel formulas used to calculate the findings. I had to make some adjustments to make them work, but it saved me time having to write them completely myself.
Once the data was collected, I uploaded it as a CSV file into ChatGPT and asked it to assess the results. It was great at understanding the parameters of the experiment, but it gave very bland and predictable assessments.