Wer kennt sich mit Selenium aus oder hat zufällig ein geeignetes Script?

**unravelmutton** · 29.08.2024, 02:21

Zitat von BlueC Beitrag anzeigen

Morning,
So let's assume I want to read certain metadata from the source code of my website at regular intervals using Selenium...
To do this, a ready-made PHP script (+cron job?) should be started on my local system that can take on this job.
There is a list of sub-URLs as a text file that are called up one after the other and processed by this script.
The whole thing should work in such a way that one URL at a time from this list is processed,
the result is temporarily stored in a text file and Selenium only continues to work after the parsing is finished.

Or to put it graphically:

Selenium runs the browser > URL list is loaded > URL1 > PHP script in path X is executed for this URL > source code is read out > echo in .txt file > continue with URL2...
URL list is loaded > URL2 > PHP script in path X is executed for this URL > source code is read out > echo in .txt file > continue with URL3...
URL list is loaded > URL3 > PHP script in path X is executed for this URL > source code is read out > echo in .txt file > continue with URL4...
etc. etc.
The whole thing should be carried out at regular intervals, e.g. every 24 hours. The number of URLs that need to be processed is relatively high.

(If there is another, simpler solution for this that would be fine too, but I've heard that Selenium is the best and fastest alternative for such browser-based operations?)

Any tips or ideas on how I can best set something like this up?

snow rider

Install Selenium and WebDriver: Install Selenium in the desired programming language (e.g. Python) and download the appropriate WebDriver (e.g. ChromeDriver for Google Chrome).
Create a URL list: This list can be a simple text file containing each URL on a new line.

The PHP script will be responsible for reading the source code of the web page and extracting the relevant metadata. It could look something like this:
<?php
function extractMetadata($url) {
// Load the URL
$html = file_get_contents($url);

// Extract metadata (e.g. <meta> tags)
$doc = new DOMDocument();
@$doc->loadHTML($html);
$metas = $doc->getElementsByTagName('meta');

$data = [];

foreach ($metas as $meta) {
if ($meta->getAttribute('name')) {
$data[$meta->getAttribute('name')] = $meta->getAttribute('content');
}
}

return $data;
}

// Example call
$url = $argv[1];
$metadata = extractMetadata($url);

// Write results to a file
file_put_contents('metadata.txt', print_r($metadata, true), FILE_APPEND);
The Selenium script can be written in Python and should control the browser to open the URLs and execute the PHP script for each URL.
from selenium import webdriver
import time
import subprocess

# WebDriver Setup
driver = webdriver.Chrome('/path/to/chromedriver')

# Load URL list
with open('urls.txt', 'r') as file:
urls = file.readlines()

for url in urls:
url = url.strip()
driver.get(url)

# Run PHP script and pass URL as parameter
subprocess.call(['php', '/path/to/script.php', url])

# Cache results
time.sleep(2) # Wait time if necessary

# Close browser
driver.quit()
Set up a cronjob that runs the Selenium script periodically. Here is an example of a cron job that runs the script every 24 hours:
0 0 * * * /usr/bin/python3 /path/to/selenium_script.py
Alternative solution without Selenium

If you are looking for a simpler and possibly more efficient solution, you can consider processing the URLs directly in PHP or another server-side script without going through Selenium. The PHP script could load the URLs in the list itself, extract metadata and store it. This would probably be faster and more resource-efficient as no browser rendering is required .

**hlightimpala** · 10.09.2024, 11:50

Hey, that sounds like a pretty cool project you’re working on! Here’s a straightforward approach to handle it. You can create a Python script that uses Selenium to process your URLs. The script would read the list of URLs, go through them one by one, extract the necessary metadata, and save the results to a text file. You can then set up a Cronjob to run this script every 24 hours.

Here’s a rough outline of what you could do: Write a Python script to manage the Selenium browser tasks and metadata extraction. Use time.sleep to ensure the page loads completely before grabbing the data. After extracting the data, append it to a results file. Finally, set up a Cronjob to run your Python script daily.

If you still want to use the PHP script in some way, you can have the Python script call it after saving the data, but for this task, Python with Selenium should be pretty efficient. It’s pretty flexible and doesn’t need too much tweaking to handle a lot of URLs.
level devil

Wer kennt sich mit Selenium aus oder hat zufällig ein geeignetes Script?