Google Sitemaps II

Okay, so the gsitecrawler program did the trick. It was able to crawl all 22,702 links on the utahlinux.com site and create a valid sitemap. The sitemap uncompressed is about 5 MB, so I uploaded the compressed version (sitemap.xml.gz) and submitted it to Google. After several minutes Google validated it and now comes the wait for it to be properly indexed. Hopefully I’ll have more original content on it by then.

Google Sitemaps

So, I’m starting up a new website, www.utahlinux.com and I just setup the Google Analytics and started working on a Google Sitemap for the site. Even though Google says that having a Google sitemap doesn’t effect the ranking of your site, I think that it helps a lot, especially with the indexing of the site with the search engine. We’ll see how long it takes to get indexed. (I submitted a sitemap for this site (utahsysadmin.com) yesterday). So we have a little contest now between the two of them.) I was disappointed to see that the Google SiteMap Generator that Google offers to create sitemaps, does not contain a crawler for following all the links on your site and indexing them for you. Maybe they would give away too much how their spiders work if they did that. Since my utahlinux.com site has a lot of pages (mainly resource and informational documents mirrored for local use), I had to get a crawler to automate the creation of the sitemap. I tried using phpSitemapNG for quite some time before I decided that the site was just to big for the script. It’s a php based program and runs pretty slow and eventually times out before ever finishing the mapping. So I downloaded one that runs on your PC, instead of your server, gsitecrawler. So far it seems to be running well and has a bunch more options. Hopefully it saves it’s progress in case it stalls out, it can start where it stopped previously. We’ll see…

mod_rewrite

So I set this blog up to use permalinks and it automatically gave me the content for my .htaccess file to use mod_rewrite to handle the links. That was pretty cool. This reminded me of a site I recently worked on that I ended up using mod_rewrite to hide all the ? and & symbols in the URL. One of the things that I was unsure of was how to create an exception to the rule like for example css files that didn’t need to be redirected. The answer was RewriteCond. Here’s an example of what I used in my Apache config:

<IfModule mod_rewrite.c>
          RewriteEngine on
          # Exclude css directory from rewrite
          RewriteCond %{REQUEST_URI} !^/css
          # Print out "static" pages
          RewriteRule /([a-z]+).htm /site.cgi?page=$1
</IfModule>

Of course, the site.cgi script handles whether or not the file exists locally, to prevent remote page calls that can compromise your server very quickly. :)