Greetings,

We're trying to use the Generator on our Wordpress site that has around 25k Wordpress pages and around 150k total pages, but after a day of running this, so far it's only crawled 88k pages.

This seems to be way too much time to effectively use this Generator in our AWS environment.. why is taking so long and how can we get this to run faster?

I don't see how this could be used properly to keep an active daily sitemap if every time the cron triggers the script to run it takes 2-3 days to complete.

I also see that in our Sitemap it seems to be adding URLs that aren't actually pages visible to visitors of our site, such as "wp-json/oembed/1.0/embed?url=http%3A%2F%2Fwww.site.com%2F&format=xml" which leads me to believe this in indexing way more pages than we want and one of the reasons it's taking too much time.
Re: How long should it take to generate a site with 150k pages?
« Reply #1 on: May 04, 2018, 04:51:15 AM »
Hello,

1. The crawling time itself depends on the website page generation time mainly, since it crawls the site similar to search engine bots.
For instance, if it it takes 1 second to retrieve every page, then 1000 pages will be crawled in about 16 minutes.

In this case I would recommend to run generator in command line if you have ssh access to your server.

2. You need to add "oembed" in Exclude URLs setting to avoid crawling those pages.
Re: How long should it take to generate a site with 150k pages?
« Reply #2 on: May 04, 2018, 07:18:23 AM »
1. Are there any instructions for running this with SSH? So we would we still use all the settings on the /generator website, just execute it from SSH?

2. Since we have a Wordpress site, we have the following Exclude URLs that we don't want to be pulled:
wp-includes/
xmlrpc.php
wp-json/
/2018/
/2017/
/2016/
/2015/

So just adding "oembed" to the bottom of this list is the correct way instead of "wp-json/ or "wp-json/oembed" how we have it now? Should we have the "/" at the start of each exclude line?

3. We also wanted to ensure that the priority of our site went in order of homepage > categories > individual pages.

Since all of our individual pages don't have a slug that can link them, I figured we would just turn off Automatic Priority and set the default to 0.8.

Then in the Individual Attributes we have the following format:
[ External links are visible to forum administrators only ]$,,daily,1.0
[ External links are visible to forum administrators only ]$,,daily,0.9
[ External links are visible to forum administrators only ]$,,daily,0.9

But this doesn't seem to be working and if we try to reduce the depth to speed up the crawl it doesn't pull these Individual pages. I figured since we manually added these URLs and assigned them the highest priority it would force them to be pulled first... Is the above code correct?
Re: How long should it take to generate a site with 150k pages?
« Reply #3 on: May 05, 2018, 05:28:50 AM »
Hello,

1. the command line to be used via SSH  is displayed on Crawling page of sitemap generator. The same settings will be used.

2. If you want to exclude other URLs from crawling, you need to add them in Exclude URLs setting too. The leading slash character is not needed.

3. This setting affects only the "priority" attribute in created sitemap. It doesn't change the way website is crawled.
Re: How long should it take to generate a site with 150k pages?
« Reply #4 on: April 11, 2019, 04:34:27 AM »
sorry but this is exactly what i face

wp-json/oembed/1.0/embed?url=
Exclude URLs:
xmlrpc
oembed
?p=
wlwmanifest.xml
wp-includes
wp-json


it still adds it the oembed
Re: How long should it take to generate a site with 150k pages?
« Reply #6 on: April 11, 2019, 05:20:17 AM »
p? is fine it doesnt show only embed does. im going to add a filter to disable it i guess.