SSEP - not following relative links correctly

Place for comments, problems, questions, or any issue related to the JavaScript / PHP scripts from this site.
mike406
Posts: 7

SSEP - not following relative links correctly

Hello, the SSEP scripts work nearly perfect for my site, except for some weird reason they don't follow relative paths correctly.

Say for example:
I add my domain

Code: Select all

https://mydomain.com
Then I index it from the root.
Next it will follow into

Code: Select all

https://mydomain.com/news/memo
and index the index.php page for this directory.
On this page, I have a relative link <a href="old_memos.php"> - this is where SSEP trips up. Rather than indexing

Code: Select all

https://mydomain.com/news/memo/old_memos.php
it attempts to index

Code: Select all

https:/mydomain.com/news/old_memos.php
and throws a Status: 404 for it, because this path does not exist. SSEP seems to be moving up one directory whenever it hits a relative path. It does this for any page I have with relative links.

Here is a screenshot of what it looks like: imgur.com/GHCAFal

Admin Posts: 805
Hello,
I tested the ssep script with relative path in link, and it works correctly, see the screenshoot bellow.

1. In /zz/memo/ i added an index.php file with this link:

Code: Select all

<a href="old_memo.php" title="vbnm">link test</a>
2. In the /zz/memo/ folder there is a file old_memo.php.

Maybe you have the relative link with "../" in front of the address: href="../old_memos.php"
- Post here the address of the page with the relative path so I can see and test it.

See screenshoot:
Attachments
ssep_test_indexing.jpg
ssep_test_indexing.jpg (43.33 KiB) Viewed 1183 times

mike406 Posts: 7
It’s an internal website so I can’t really send a link as it wouldn’t be accessible. And I can confirm that there is no ../ in front of the link as I am the web programmer for my company and can view both the code and the directory structure of the server. I can also confirm that there are no 404 links on the site either. It happens hundreds of times in the logs SSEP gives, for every instance it finds a page with a relative link on it. So I know something is amiss. I could see there being a mistake on my end if it only happened for certain pages, but not all of them. Perhaps there is something in my Apache configuration that SSEP does not play nice with.

Admin Posts: 805
The crawler in the ssep script reads the html source of the page (with cURL, from the given url), extracts and follows the links.
I think the problem is from the link of the previous page, link-folder with not ending "/".

Code: Select all

mydomain.com/news/memo
The crowler reads that link as a page in "mydomain.com/news/", not a folder; than it keeps the "mydomain.com/news/" folder as base for the relative links in that page.
The link: '/news/memo' might be a folder (category) but also just a page in '/news/'; the crawler does not know and since the url has not ending "/" it sees it as a page in /news/.
- If you add ending '/' to the links that are a category (folder) like 'mydomain.com/news/memo' (/news/memo/) it will work.

mike406 Posts: 7
URL normalization is a responsibility of the web spider. There is no RFC specification that says trailing slashes must be included, and are acceptable in the HTML spec to be with or without. Furthermore, my apache server serves 301s to any non-trailing path as shown and redirects to a trailing slash: imgur.com/5WAKiyM

So because my server does correctly serve 301s, either there is a side effect with PHP's curl, or perhaps some additional logic is needed to detect the 301 redirect and set the base path accordingly. It would be a nice improvement to SSEP to allow for omission of trailing slashes.

Sources:
Section 6.4.2 - tools.ietf.org/html/rfc3986#page-42
Section 3 - tools.ietf.org/html/rfc1808#section-3

Admin Posts: 805
Thanks to your information I updated SSEP to check the redirected url from server.
Now it sets the base path correctly, and it works with omission of trailing slashes.
- Download the new version (from: coursesweb.net/php-mysql/ssep-site-search-engine-php-ajax_s2 ), use it or just replace the 'ssep/php/crawlindex.php' file from new version.

mike406 Posts: 7
Awesome! That was fast, and it worked perfectly. Now I have a great index of my site. Thank you!