Example Script for Collecting YouTube Comment Data
In this guide, you will learn how to use an AgentQL script to navigate to a YouTube video and collect comment data.
Prerequisites
Instructions
If you'd like to start with the full example script, you can find it at the end of the tutorial.
Step 0: Create a New Python Script
In your project folder, create a new Python script and name it example_script.py
.
Step 1: Import Required Libraries
Import needed functions and classes from playwright
and agentql
libraries and import the logging
library.
import logging
import agentql
from playwright.sync_api import sync_playwright
logging
provides logging, debugging, and information messages, playwright
provides core browser interaction functionality and agentql
adds the main AgentQL functionality.
Step 2: Launch the Browser and Open the Website
logging.basicConfig(level=logging.DEBUG)
log = logging.getLogger(__name__)
URL = "https://www.youtube.com/"
with sync_playwright() as playwright, playwright.chromium.launch(headless=False) as browser:
page = agentql.wrap(browser.new_page())
page.goto(URL)
- Set up logging:
logging.basicConfig(level=logging.DEBUG)
configures the logging to show debug-level messages. - Define
URL
: The URL"https://www.youtube.com/"
is the target website for the script. - Start Playwright instance with
sync_playwright()
. - Launch the browser:
playwright.chromium.launch(headless=False)
- Create a new page in the browser and wrap it to get access to the AgentQL's querying API:
agentql.wrap(browser.new_page())
. - Navigate to the website with
page.goto(URL)
.
Step 3: Define AgentQL Queries
SEARCH_QUERY = """
{
search_input
search_btn
}
"""
VIDEO_QUERY = """
{
videos[] {
video_link
video_title
channel_name
}
}
"""
VIDEO_CONTROL_QUERY = """
{
expand_description_btn
}
"""
DESCRIPTION_QUERY = """
{
description_text
}
"""
COMMENT_QUERY = """
{
comments[] {
channel_name
comment_text
}
}
"""
These queries provide the tool for communication and interaction with the right elements on the website. Ensuring you have functional and reliable queries is paramount!
Use the AgentQL Chrome Extension in parallel with the AgentQL SDK to test different query formats and keywords directly with the webpage!
Step 4: Try & Except Block
try:
# query logic
except Exception as e:
log.error(f"Found Error: {e}")
Catching Errors: Since an AgentQL query will either return an AQLResponseProxy
or throw an error during the querying process, best practice suggests to encapsulate query logic with a try and catch block for better debugging.
Step 5: Execute Search Query and Interact with Search Elements
response = page.query_elements(SEARCH_QUERY)
response.search_input.type("machine learning", delay=75)
response.search_btn.click()
- Search Query: Here we pass the
SEARCH_QUERY
to query specific elements on the page to interact with the search elements on YouTube page. - Type and Click: It types
"machine learning"
into the search input with a delay of 75ms between keystrokes and then clickssearch_btn
More information on the type()
and other available interaction APIs you can find in the official Playwright documentation.
Use the AgentQL to_data()
API when you want to convert the AgentQL Response into dict
of data and work with it rather than treating it as web elements!
Optional Step: Convert AgentQL response to a Python dict
Since the raw response is an AgentQL response object, it is not easy to work with it. You can use the to_data()
API to convert the response to a Python dict
.
SAMPLE_DATA_QUERY = """
{
videos[] {
title
date_posted
views
}
}
"""
response = page.query_elements(SAMPLE_DATA_QUERY)
log.debug(response.to_data())
The to_data()
API converts the AgentQL response to a structured dict in which it replaces the response nodes with text contents of the nodes.
Sample result would be as follows:
{
"videos": [
{
"title": "This is a nice video!",
"date_posted": "1 month ago",
"views": "2.7K Views"
},
{
"title": "This maybe a better video!",
"date_posted": "3 months ago",
"views": "2.7 Million Views"
},
{
"title": "This is best video!",
"date_posted": "1 year ago",
"views": "37.5K Views"
}
]
}
Step 6: Execute Video Query and Interact with Video Elements
response = page.query_elements(VIDEO_QUERY)
log.debug(
f"Clicking Youtube Video: {response.videos[0].video_title.text_content()}"
)
response.videos[0].video_link.click() # click the first youtube video
- Video Query: The script runs the
VIDEO_QUERY
to interact with video elements on the search results page. - Click on Video: It clicks on the first video link in the results.
- Logging: A debug message logs the title of the clicked video.
Step 7: Control Video Playback and Show Description
response = page.query_elements(VIDEO_CONTROL_QUERY)
response.expand_description_btn.click()
- Video Control Query: Run the
VIDEO_CONTROL_QUERY
to interact with video controls.
Step 8: Capture and Log Video Description
Instead of converting the intermediate AgentQL response to dict
via to_data()
API you can call page.query_data()
method to query for structured data at the first place.
response_data = page.query_data(DESCRIPTION_QUERY)
log.debug(
f"Captured the following description: \n{response_data['description_text']}"
)
- Description Query: Executes the
DESCRIPTION_QUERY
. - Logging Description: Logs the captured description of the video.
Step 9: Scroll Down the Page to Load Comments
To load comments on a YouTube video page, we need to scroll down the page a few times.
for _ in range(3):
page.keyboard.press("PageDown")
page.wait_for_page_ready_state()
- Press
PageDown
Button: Presses thePageDown
button to scroll down the page. - Wait for Page Ready State: Waits for the comments to load with AgentQL's
wait_for_page_ready_state()
method.
Step 10: Capture and Log Comments
response = page.query_elements(COMMENT_QUERY)
log.debug(f"Captured {len(response.comments)} comments!")
- Execute Query: Pass the session our
COMMENT_QUERY
to capture comment section data. - Count and Log: Here we simply log the number of comments captured.
Step 11: Stop the Browser
browser.close()
- Call
browser.close()
to releasing resources and close the browser.
Step 12: Run the Script
Open a terminal in your project's folder and run the script:
python3 example_script.py
Putting it all together…
Here is the complete script, if you'd like to copy it directly:
import logging
import agentql
from playwright.sync_api import sync_playwright
logging.basicConfig(level=logging.DEBUG)
log = logging.getLogger(__name__)
URL = "https://www.youtube.com/"
with sync_playwright() as playwright, playwright.chromium.launch(headless=False) as browser:
page = agentql.wrap(browser.new_page())
page.goto(URL)
SEARCH_QUERY = """
{
search_input
search_btn
}
"""
VIDEO_QUERY = """
{
videos[] {
video_link
video_title
channel_name
}
}
"""
VIDEO_CONTROL_QUERY = """
{
play_or_pause_btn
expand_description_btn
}
"""
DESCRIPTION_QUERY = """
{
description_text
}
"""
COMMENT_QUERY = """
{
comments[] {
channel_name
comment_text
}
}
"""
try:
# search query
response = page.query_elements(SEARCH_QUERY)
response.search_input.type("machine learning", delay=75)
response.search_btn.click()
# video query
response = page.query_elements(VIDEO_QUERY)
log.debug(f"Clicking Youtube Video: {response.videos[0].video_title.text_content()}")
response.videos[0].video_link.click() # click the first youtube video
# video control query
response = page.query_elements(VIDEO_CONTROL_QUERY)
response.expand_description_btn.click()
# description query
response_data = page.query_data(DESCRIPTION_QUERY)
log.debug(f"Captured the following description: \n{response_data['description_text']}")
# Scroll down the page to load more comments
for _ in range(3):
page.keyboard.press("PageDown")
page.wait_for_page_ready_state()
# comment query
response = page.query_data(COMMENT_QUERY)
log.debug(f"Captured {len(response.comments)} comments!")
except Exception as e:
log.error(f"Found Error: {e}")
raise e
# Used only for demo purposes. It allows you to see the effect of the script.
page.wait_for_timeout(10000)