Duplicate content on websites has a negative impact on the ranking in search engines like Google. But there are cases in which you simply can not avoid it due to technical needs. In this article we will show a solution and some ideas on how specific problems can be solved using FirstSpirit just by adding some lines of template code.
Do you have duplicate content in your web projects? Most organized projects will not show any signs of duplication at first. What makes it difficult is the way search engines scan the content of any page, checking the index of any given link/URL. Another problem is that even reusing content in a different part of the index is considered a duplicate. A search engine does not care how often a piece of content was created in FirstSpirit. For a search engine it only matters how many different URLs provide the same information.
Roughly speaking, every time when two different URLs end up delivering the same (or nearly the same) content, this is considered "duplicate content". Examples could be:
Landing pages: /fs5/ (as landing page) and /marketing/campaigns/2013/FirstSpirit5.html (the "real" URL) deliver exactly the same content (using FirstSpirit Short URLs or Apache rewrite rules and not 301 redirects).
Sortable elements (put away pagination for a moment) have the same content, just in a different order:
International websites: When serving different regions in the same language offering only slight differences on the pages, a search engine may also consider it as duplicate content. Those small differences may be only due to country specific marketing focus or having to fulfill special laws in that country and are not significant enough to create a "different" page.
Intentional duplicate content: In some cases, identical content items have to be provided in different parts of a website: The press release of a subdivision will also appear in the corporate media center.
A search engine will not be able to identify the "real", "preferred" or - the technical term - "canonical" one. So it cannot consolidate URL properties. In consequence, the rank of an individual content item is spread over many URLs instead of being concentrated on what the content author considers the "best hit". The result is many medium or even low page ranks. What you want is something like a "concentrated" popularity reflecting the real value of the content. This would lead to - of course - one single but much higher page rank.
General solution in HTML
First of all: Try avoiding duplicate content by always directing users to the same URL. But as the examples above show, you may find yourself in a situation where this is not possible. But there is a solution: Using the <link rel=”canonical”>-Element (defined in RFC 6596) in the <head>-area of a page, you can tell search engines that the current page is not the "master page" and direct it to the real content source:
<link rel="canonical" href="/path/origin.html" />
Google explains the problem and usage (how and when) of that element in more detail here: https://support.google.com/webmasters/answer/139394 (make sure you watch the video!). Keep in mind that when using the canonical element, it is always just a hint for a search engine. How it will eventually affect the way it will present and rank the search results is still up to its manufacturer. Furthermore, the exact details are still subject to change, so keep yourself updated!
So now you know how your code has to look like in the end. But how do you implement these suggestions when dealing with FirstSpirit projects? As there are way too much project specific variations to cover them in this article, let's just focus on three examples:
Landing pages (using the Short URL feature of FirstSpirit)
Multiple page references (sitestore) to one content page (pagestore)
Manual linking of canonical pages
For all examples we will use the "Advanced URLs" in the generation tasks as this one uses the advanced URL mechanisms - including the usage of stored URLs:
Example 1: Landing pages using Short URLs
In FirstSpirit 5, you can define additional "Short URLs" in Global settings => Short URLs (don’t confuse it with the "SEO URLs" feature which changes paths!).
What are Short URLs in FirstSpirit?
Maybe not everybody is familiar with this feature of FirstSpirit 5 yet, so we will explain this in detail. As a short example, we are using a site in German and English with some news provided by a datasource, filtered by categories using a query on the page references (quite much like the Mithras Energy product categories overview pages). The site has the following structure:
Now we want to create additional Short URLs for some overview pages.
Using this feature, FirstSpirit will create additional files for those page references in the specified path(s) upon generation as you can see from the directory tree below.
This is not the same as just manually creating another page reference to the same content page. These additional files behave as if they were located in the same position as the original page reference. So that additional file is more of an "optical copy" of the "real page": The page looks exactly the same for the visitor, including for example navigation and values of sitestore variables. In the screenshot below, you see that - even if the generated file is located in /Travel/index.html - FirstSpirit generated the content as if it was the "real" page located elsewhere in the structure.
The origin page looks exactly the same - and there is no additional template syntax needed. Just FirstSpirit magic.
This could not have been achieved by using additional page references. They would show the navigation according to their position in the sitestore tree, using the sitestore variables and metadata defined for that level. Simply because that behaviour is one of the main features of additional page references: Reusing pages in different structural contexts.Those additional files are not just file copies. If you take a look at the code you will see that the generated relative paths to other FirstSpirit elements (internal links to pages and media files) do reflect the different position in the file system.
Creating the canonical element
So how do we create a "canonical link" from the "copy" to the "original" now? An input component, maybe in the metadata? No, not for Short URLs. There is no object you could link from - FirstSpirit creates those extra files on the fly upon generation and there is no counterpart anywhere in the sitestore or pagestore. The solution is much simpler here: Why not request the canonical URL from FirstSpirit directly? By inserting just three lines of code between <head> and </head> in your page template, you are done:
This generates in the files /Travel/index.html and /Travel-News/index.html.
There is one thing to keep in mind: When creating the canonical URL, FirstSpirit will generate an absolute link (starting with "/") even if your project is configured to use relative ones. So if your deployed site is not located directly beneath the document root, you have to use the setting "prefix for absolute paths" in the generation task.
Example 2: Multiple page references
If you have more than one page reference to a content page, FirstSpirit will create one file for each of them. To create the canonical link elements, we need to
Get the PageRef candidates to this content page using #global.page.incomingReferences and filtering the results to get a list of candidate PageRefs.
Decide which one should be the “main” page reference, i.e. the canonical one.
Sounds quite simple, but we have to consider some details.
Step one: Get a candidate list
First we have to define how to deal with content projections. Normally they use different filters in queries so that they do not create duplicate content. If you look at the example above (news categories) there are multiple page references but they are all creating different content. So here we will just ignore content projections (by looking if we have Content2Params).
Secondly, the API method Page#getIncomingReferences() (defined in the StoreElement interface) returns all references to the page. Depending on your project there may be other references to a page than just PageRefs. And even if they are PageRefs, maybe the content page is not referenced as the page to be displayed by that PageRef but somehow else (e.g. in the metadata). Those cases are rare but if you are using those mechanisms, we have to filter them out, too.
Step two: Choose the “right” candidate
The next step is deciding which PageRef should be the canonical one. We could try it the easy way and just take the first one we find using the filtered results of Page#getIncomingReferences(), but this would have a great disadvantage in the SEO context: There is no guarantee that the order of the incoming references will remain the same over time - especially if additional page references are added later. Thus the evaluated canonical page could change. So we have to look for an alternative.
Why not just consider the "oldest" page reference the canonical one? In most cases this should fit your needs because the oldest PageRef will most likely be the first one that showed up on your site and thus the first one indexed by a search engine. We could call this the "oldest PageRef is canonical" approach. An easy way to get the oldest PageRef is to just use the one with the lowest ID because FirstSpirit assigns IDs in ascending order upon object creation.
Our objective is to get all incoming references for the current Page object which are page references and are using the current Page object as content page and take the one with the smallest ID.
We will use mapping expressions (lambdas) which are far more straight forward for such kind of tasks (see ODFS: Mapping expressions) - in this case we chain three of them:
The first lambda (.map) "converts" a list of ReferenceEntries (returned by the .getIncomingReferences method) to a list of IDProviders, the second one filters that list according to our criteria (defined in step one). So after applying the .filter() function we have a list of candidates - step one: done.
Performing step two is surprisingly simple thanks to lambdas: The .min() function retrieves a "minimal element" from a list, our candidate list in this case. When using .min() you can define the minimum criteria yourself by using any expression that returns a "Comparable". For each element in the list that expression will be evaluated and the element that yields the lowest result is returned. As stated above, we simply use the ID here.
So the result of the lambda chain is exactly the PageRef we want. After a simple check if we really have a result and if we are not currently generating the canonical page itself, the link element is rendered.
Example 3: Manual definition of the canonical page reference
If you want more control over the generated canonical element, you could of course implement a version that uses manual settings defined by an editor. One could for example define a FS_REFERENCE in the metadata of the page that links to the "canonical pageref".
A good solution would be to give editors the possibility to define a canonical page reference manually and use the automatic version from example 2 as a fallback.
How are you dealing with duplicate content for search engines? Let us know in the comments for an even bigger picture on the topic!