Tuesday, September 5, 2017

Remove duplicate content - Session ID pages

By Adrián Coutin

Duplicate content and SEO have a bad relationship, more duplicate content in google index we have in a website the worse the positioning will be. However, the existence of duplicate content, in many cases, is unintentionally created.


Duplicate content: Session ID pages 


The following example is related to a poor configuration of the robots.txt file that allowed googlebot access to product pages with session ID in the URLs.

see Duplicate content - session id urls in google index
Example of session ID URLs indexed Google source: Google Search Console
In this case each session ID of a web page duplicates the content of the original page. If any spider crawl these pages  will find the same content, the page of the product itself, repeated in different URLs.

How googlebot will find out session ID URLs if they are not available in the navigation system or any other internal links schema?

Googlebot crawls any web pages from different sources of backlinks. If, for example, one user share in a social network  a product link from a URL with the user' session ID, googlebot will crawl this page.

How Remove Duplicate Content created for Session ID pages?


Google Search Console - Google Index


According to the data provided by Google Search Console, this website has more than  6,000 pages indexed for Google. Nevertheless the number of pages that this website have ready to be indexable are more than 12,000 pages.

Scraming frog, with the right configuration, can access to these number of web pages however Google, for any reasson (probably duplicate content or crawl speed time) limits the number of pages to be included in its main index.

After a new design using a new CMS and, of course,  with a new robots.txt configuration blocking access to session ID pages, Google Search Console reported a high number of 404 errors mostley conected with Session ID pages.

Google Search Console - Not found pages

In the website focus of this post a migration process was implemented, new urls, redirects, etc. helps to reduce faster Session ID pages via 404 status code. However in a website with these kind of pages in google index can take more time.

The following approach can be developed for those websites without a migration process.

First step, always, update robots.txt blocking Session ID pages. Example of line to be add in robots.txt

Disallow: /*?SID=

This line will block all URLs with session ID parameter as:

https://mysiteurl/contact/?SID=sfl317buq8ru4uf4a...

2. Googlebot won't index new Session ID pages however a high number of these pages are going to be in Google index for long time.

3. Several options can be manage for webmasters, SEO specialists and developers to reduce as soon as possible the number of these pages in Google Index:

- Disable session ID in URLs via CMS or php code. Website won't include this parameters in URLs

- Define via htaccess that all session ID pages report 410 code status (the requested resource is no longer available at the server and no forwarding address is known).

This approach has the advantage to reduce the number of session ID pages, duplicate content, faster. However can affect funtionability in e-commerce website (multi-store configuration).

In case that is not possible to delete session ID in URL will be necessary these pages are going to be more time in Google Index.



No comments:

Post a Comment