Is duplicate content penalising your site?

Multilingual SEO Strategy Guide

Gemelle che urlano

You have products on an e-commerce site and some content on your website or blog, and you have decided to translate it to make your site multilingual and engage a specific international market.

But suddenly, you have a doubt: what if my translated content registers as duplicate content? Will Google penalise me for it?

For those of you in a rush, the answer is no: if you have a site and have translated the content for several countries, this shouldn’t be a problem.

But there is a “but”. And a few things you should know to avoid suffering consequences in terms of SEO positioning of your multilingual site due to the presence of possible duplicate content.

What is duplicate content?

Let’s imagine that you manage an e-commerce of Bluetooth smartwatches and your CMS (Content Management System) such as WordPress or Joomla allows you through some filters to change the order of display of products.

This may mean that every time you decide on display criteria, the corresponding URL also changes automatically.

The result? You end up with multiple pages, each with its own URL and a different content distribution, but in terms of the actual content displayed, they are extremely similar. Pages with duplicate content.

According to a study by Raven Tools from 2015, this is critical. The study estimated that a full 29% of sites scanned by Googlebot (the automatic tool that scans the web to index content) feature duplicate content.

Therefore, when we create content we need to make sure we let Google know which pages to show (and potential lead) the user, so we can turn visits into conversions. In fact, the entire point of Google is to make navigation satisfying for users, and avoid showing the same content repeated over and over in the same search.

This is Google’s own definition:

Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar.

Basically, it’s a matter of indexing circumscribed to Google that, however, you should know hot to manage.

Let’s clarify something here: unknowingly having duplicate content won’t penalise you, but it won’t optimise you either.

We say this because there are some practices that Google considers deceptive: it often happens that e-commerce or site managers publish duplicate content on purpose through multiple domains to monitor their own search engine positioning and increase traffic.

You need to be careful with this, because in these cases you run the risk of being penalised or even having your site removed from search results.

Why avoid duplicate content, a SEO perspective

As Moz suggests in this article on duplicate content, it’s always a good idea to tell the search engine which version of your site to index and position in order for it to understand whether to direct link metrics (presence, anchor text, link profile analysis) to a single page or to keep them separate between several versions.

If you don’t give this the proper attention, the presence of duplicate content could lead to loose your positioning and presence. This means that every website that links back to yours through backlinks could find itself, to its own detriment, having to choose which duplicate to display.

How to improve existing content on your site

What elements of your multilingual site do you need to take care of to avoid creating duplicate content?

1. URL structure

Domains have a major effect on SEO. Uniform Resource Locator (URL) parameters are the first elements to stand out in Google Analytics reports an immediate references to content. These are the first places where search terms are optimised, and the first to immediately provide an idea of where content is located within the information tree.

A well-structured URL must have:

  • a hierarchical structure: domain/parent-page/child-page
  • lowercase letters
  • words separated by a dash
  • brevity, conciseness, and absence of spaces and non-ASCII characters
  • search keywords but not too many, so as to be clear to both users and search engines.

2. The content’s translation

Today “content is king”, but this “king” has to be original, authoritative and customized.

Customizing content means that users want to read content localised for their market or, at the very least, in their language.

Each translated version will not only need well-localised content, i.e. content that has been well adapted to the target audience, but each one will also need to develop its own customer journey map, a specific keyword study which will create a different content architecture, different texts and different SEO element optimisation.

As we were saying, translations are not duplicate content, but you do need to be careful about how they are done. If the translation is made using software, webmaster tools, or even Google Translate, and is not revised, the quality won’t be the best.

Translations done by computer (automatic translation) often do not come out naturally. They are easily identified by their lack of personal touch and sometimes even classified as spam.

To prevent this from happening, the best solution is to hire a translation agency with professional native translators to ensure your visitors a better experience and content delivery.

To make sure visitors always find different and updated content on the various pages that make up your site, you can reinforce your SEO by changing sentences and find new solutions in our article about which keywords to use in titles and descriptions that appear on the SERP.

Bad practices that generate duplicate content

1. Untranslated content on localised domains

Let’s imagine you have created more localised domains to target more international markets with your smartwatch e-commerce, like a .it to create a domain in Italy, and a .de domain for Germany.

If you haven’t translated and localised your various content, the search engine will find it duplicated on all domains. And, though Google knows where users are typing from and which is the correct version to show them based on the domain and their country of reference, without translated content, you risk Google failing in its attempt.

So, translation and professional localisation of content remain the keys to providing a good user experience and to proving to Google that you have contextualised and reshaped content that, in this way, is authentic.

2. Content extracted from other sites (content scraping)

Google’s Panda algorithm doesn't like scraping content. Content created through web scraping is generated through an automated process of extracting data from a website using software programs which mimic human navigation.

This practice is generally implemented by e-commerce companies that sell multiple versions of the same product and that often faithfully report product descriptions taken from other online sites, usually the manufacturer’s, without making any additions and/or changes to the content of the descriptions.

Duplicate content

Again, a translation, localisation and content writing agency can take care of rewriting this content so that it is not considered duplicate.

3. Content republished on other sites (content syndication)

Another pitfall is when content republished on other sites creates duplicate content.

To get around this problem, ask the site that is disseminating your content to create a backlink to your site with an appropriate anchor text.

Alternatively, republished content can be marked with the rel="canonical" link tag. This will be discussed further on, but essentially, this informs search engines which URL of the same version to consider “canonical”, that is, the main one. Yet another alternative which may work for you is to use the noindex meta tag, which we will explain in the next paragraph.

In any case, for some extra certainty, know that Google does not consider as duplicate content republished on LinkedIn or Medium. We recommend, however, that you wait at least 7 days before republishing them so that Googlebot has time to index the original content on your site first. Alternatively, go ahead and publish to these platforms first and implement rel="canonical" by linking it to the version on your blog.

How to fix duplicate content

What technical solutions can you adopt in a duplicate content situation on your multilingual site?

1. Inserting the rel="canonical" tag into the source code

The solution to clarify the relationship between pages with different URLs that are very similar or almost identical, and to handle the phenomenon of duplicate content is the rel="canonical" attribute. Canonical URLs, which are very useful for our smartwatch e-commerce, tell Google which version to consider canonical (the main one): this means that all the SEO data generated by the other duplicate versions will be channelled toward this version and that this will be the one shown in SERPs.

It is good SEO practice to insert the canonical link tag into the HTML file header, within the <head> tag of the main version: the canonical link tag may, in fact, be self-referential:

<html\>  
<head\>  
<link rel="canonical" href="https://www.bluetoothsmartwatch.it"/\>  
</head\>  
</html\>

and it should be inserted into the HTML files of duplicate versions in the same way.

Be careful, though, because as SemRush states in its article, a canonical tag is just a piece of advice you give to Google but not an imposition.

2. HTTP or HTTPS? With or without www?, final slash or not?

Canonical tag links represent just one of the tools in your repertoire.

Needless to say, URL consistency is key.

  • Just remember, having two active HTTP and HTTPS versions on your site that have identical content and are visible on search engines is enough to count as duplicate content, even if unintentional, as it is in most cases.

In the case of e-commerce, choose HTTPS as the preferred version present in your domain: on the one hand, it will reassure users that your site is secure, especially when there is sensitive information to be provided and saved and, on the other hand, Google much prefers this and can position it better.

When creating a domain (with or without WWW), it’s best to choose a preferred version: this decision tells the search engine which domain to scan and index, which will lead to better results.

The same could be said for versions with or without a final slash (trailing slash): a trailing slash at the end of a URL indicates a directory, whereas a URL without a trailing slash indicates a specific file. Again, you will need to choose a preferred version.

To solve the problem of generating duplicate content in the three cases we have looked at, the most appropriate solution is permanent Redirect 301 code. This means that the URL of duplicate content will redirect you to the one you have chosen as your preferred version so as the positioning, traffic and tracking of the old URL is not lost. This can be done by accessing the .htaccess file:

Redirect 301 /bluetoothsmartwatch.en/ https://www.bluetoothsmartwatch.en/

If you want to opt for a more immediate solution and manage your website on Wordpress, you can resort to the All In One Redirection plugin.

3. The NOINDEX meta tag attribute

Another way to go when you have two pages with similar content, like a regular page and the version intended for printing, for example, is to insert a duplicate page into the source code:

a <meta> tag with the attribute robots="noindex</meta>"

to stop the engine bot (also called crawler or spider) from scanning it.

4. HREFLANG tags for managing localised sites

When we have a multilingual site and we want to reach users who live in different countries and speak different languages, we can’t avoid inserting in the source code the attributes hreflang and rel="alternate" to signal to the Googlebot that the same content is translated and addressed to different geographical areas and different languages and, therefore, not duplicated.

For example, the different versions (English for the UK and German for Germany) of our smartwatch e-commerce will have in their respective sections <head> the following strings:

<link rel="alternate" hreflang="it-it" href="https://www.bluetoothsmartwatch.it/"\> <link rel="alternate" hreflang="de-de" href="https://www.bluetoothsmartwatch.de/"\>

There are two important aspects to pay attention to here when using the HREFLANG attribute:

  1. language codes must be expressed in ISO 639-1 format, and country codes in ISO 3166-1 Alpha 2 format.
    It’s important to remember that you can specify the language without specifying the country, but you can’t do the opposite: Google does not automatically deduce the exact language from the country code used. What’s more, the country code always comes after the language code;

  2. return links must always be present: once the attribute has been inserted, if page A links to page B, page B must have a link to page A too, as the HREFLANG sign may otherwise be ignored or misinterpreted by the search engine.

Before we reach our conclusion, we want to leave you with some tools you can use together to uncover any duplicate content:

  • one simple method is to insert an entire portion of text into Google, as opposed to the usual keyword or keywords
  • Copyscape;
  • Copyscape’s Compare function, which compares two URLs
  • Siteliner, a free tool that allows you to find duplicate content within a site by simply entering the URL.

Conclusions

  • Opening up to foreign markets through a multilingual site is a choice that involves a strategic and well thought out localisation and content writing to avoid incurring in the creation, conscious or not, of duplicate content.
  • Not only do you need to pay attention to the content translation process, but also other, more technical elements that can prevent your multilingual site from being properly indexed from an SEO perspective.
  • Providing a valid user experience involves a mix of linguistic and technical aspects which we need to be aware of when managing an e-commerce or online business.

Glossary

anchor text
is the legible text of a link which leads to another website.
backlink
a return link from other websites that lead to yours, increasing your presence.
conversion
the behaviour or reaction of a user induced to perform a desired action by the website manager, e.g. product purchase, newsletter subscription.
Google Panda
a Google filter algorithm concerned with content quality, which penalises those that are uninformative, empty and superficial. It was released in April 2011 in Europe and is being updated continuously. It was last updated in July 2015.
lead
potential contact or buyer interested in the product and/or service sold on the website.
SERP (Search Engine Results Page)
results pages generated on request of a customer/user by typing in one or more keywords.

Further reading

If you need a translator for your multilingual site, contact us for a non-binding quote.

Translator, UX designer, content writer. Degree in Languages and Master’s in Localisation and New Technologies. She’s been working with Qabiria since 2018.

Further Reading

Chat to one of us

Let us know what you need by sending an email to hola@qabiria.com or by filling in the contact form. We guarantee a response within 24 hours, but usually we’re much faster.

Contact us