Suggestion - Have SSEP remove stale urls from database after re-indexing
Posted: 03 Dec 2019, 16:33
Hello, I run the ssep_cron script daily and it works great for re-indexing the registered pages in the database. It would be great if a function could be added to delete pages that are stale (no longer exist on the website) from the database. I poked around in the crawlindex.php code for a bit and I imagine this could be achieved by having an array that holds URLs that were indexed during the script run, and then at the end of the process, delete from the database any stale URLs that are not in that array. Thanks!
Edit: I noticed there is a deletePages() function. If I call this function prior to re-indexing would that work? For example, this code in the cron script:
Edit: I noticed there is a deletePages() function. If I call this function prior to re-indexing would that work? For example, this code in the cron script:
Code: Select all
// Specify crawl settings as defined in php/crawlindex.php
$objci = new crawlIndex($obsql);
$objci->deletePages();
$objci->reindex = 1; //sets to re-index existing registered pages (0 to not re-index)
$objci->max_depth = 2; //depth to index
$objci->url_exclude = array('/index.php', '/search', 'webapps', '/branch/', '/account/', '/account_', '/departments/pc', 'changeStyle.php', '/news/memo'); //paths to exclude
$objci->deltags = [ ['a'=>[]], ['form'=>[]], ['select'=>[]], ['script'=>[]], ['link'=>[]], ['style'=>[]] ]; // array with tags to complete delete [ [tag=>[attr=>[values]]], ... ]
$_SESSION['ssep_dom_id'] = getDomainId($obsql, $objci->domain); //gets $_SESSION['ssep_dom_id'] from database
$start_url = 'https://' . $objci->domain;
$objci->run($start_url); //starts indexing