Questions about indexing a Joomla based site with 500K + URLs
« on: September 24, 2015, 09:54:21 PM »
I got a copy of your script a weeks days ago, installed it on my site and have run it at least 3 times since. I change settings after each run in order to improve results, but I am still unable to index the whole site.
1st run -> 4K pages indexed
2nd run -> 27K pages indexed
3rd run -> 82K pages indexed

I'm running the script from the command line
php /xxx/xxx/xxx/public_html/generator/runcrawl.php > /dev/null &  , and here are my questions:

a) why some pages are processed but not indexed?

b) where is the log file located?

b) how to resume script execution from command line? I have set the "Monitor crawler window and automatically resume if it stops in X seconds:" configuration value to 30 secs, but I'm unsure if the script is resuming automatically.

c) can I add some parameters on the run to the command line execution, eg. Starting URL, Save Sitemap to and Your Sitemap URL?

d) it looks like the script indexed 30 pages per minute (1 every 2 secs), 1,800 per hour. Any way to speed that up?

Thanks a lot for your help,

Re: Questions about indexing a Joomla based site with 500K + URLs
« Reply #1 on: September 25, 2015, 06:08:41 AM »
Hello,

1. some pages have blocking robots meta tag or are redirected to another URL.

2. the dump with crawling session data is located in generator/data/crawl_dump.log file. The debug log is sent to console outpuit when running in command line (you are redirecting it to /dev/null)

3. each time you run it in command line generator resumes the previous session if it exists.

4. no.

5. The crawling time itself depends on the website page generation time mainly, since it crawls the site similar to search engine bots.
For instance, if it it takes 1 second to retrieve every page, then 1000 pages will be crawled in about 16 minutes.
Re: Questions about indexing a Joomla based site with 500K + URLs
« Reply #2 on: October 06, 2015, 04:24:42 PM »
Hi, thanks for your answers.

Could you please take a look at my sitemap generator configuration? if so, please let me know how to share my login info with you.

My site is almost 1 million pages, the script processed 206,141 but it only indexed 90,473.
Searched for "the dump with crawling session data is located in generator/data/crawl_dump.log file"   but such file doesn't exists.

I need to know why these pages are not indexed, as far as I know, there is nothing blocking them out, like a robots.txt or something else.

thanks,
M

Re: Questions about indexing a Joomla based site with 500K + URLs
« Reply #3 on: October 07, 2015, 06:01:16 AM »
Hello,

details on skipped pages should be displayed on the details page of changelog. Please let me know your generator URL/login in private message to check this.
Re: Questions about indexing a Joomla based site with 500K + URLs
« Reply #4 on: October 07, 2015, 01:15:30 PM »
login info, sent by PM, thanks for reviewing our configuration

re: detail on skipped pages,
changelog only shows 1,000 skipped pages
while it shows 206,141 processed pages but only 90,473 indexed,
I need to find out why it is not indexing them all.
Also, like I said, the site has around 1 million pages, but only 200K are being processed, and 90K indexed.
Why it is not crawling them all? not even "processing" them, let alone "indexing" them.
Re: Questions about indexing a Joomla based site with 500K + URLs
« Reply #5 on: October 09, 2015, 06:37:21 AM »
Hello,

1. many pages are skipped because of the blocking robots meta tag, for instance:
http://www.yourdomain.com/ncaab-news/081515-michigan-st-moving-pauga-as-part-of-reorganization
has this tag:
<meta name="robots" content="noindex"/>

2. changelog details are limited to first 1,000 entries so that the script consumes less memory.

3. please PM me an example URL that is not included in sitemap and how it can be reached starting from homepage.
 
Re: Questions about indexing a Joomla based site with 500K + URLs
« Reply #6 on: October 09, 2015, 03:29:25 PM »
Thanks again for your great help,

re:  # 1
I'll find out why that "noindex" tag is there

re: # 2
ok

re: # 3
I'll send you details now