Elm SPA caused all our pages to be removed from Google Search

We converted a Drupal site, and used Elm for the front-end. However Google does not see our site as an SPA as far as we can tell, so it only sees and indexes the index.html.

After a while all our pages were removed from Google Search, and traffic has dropped by 90%. See here what Google returns as indexed.

We have provided a sitemap but that didn’t help (we had this in place at launch). And from poking around in the Google Search Console it simply appears Google only sees the index.html content, which contains nothing. It does not launch a browser and wait for the content to be rendered.

We have decided to prerender our pages, as clearly Google does not recognise SPAs. We had done a ton of research, and from what we had seen Google does handle SPAs. But clearly it does not.

Is there any help the Elm community can give to see why Google does not recognise our SPA?

1 Like

I’m not sure if this is the problem in this case but the normal googlebot and the chromium based googlebot that runs JS are different and the chromium one used to run with a delay compared to the non-js bot.

If you transitioned from the old page to the new then maybe the chromium-googlebot still didn’t have time to crawl the site with JS?

There may also be a wide variety of problems with the site, I’d start by looking into something like these:

Google does index JS websites, see for example package.elm-lang.org:

The site has been up for months, and pages are frequently crawled. But it just sees the index.html. Definitely does not fire up any kind of real browser.

1 Like

If you want to go down the pre-rendering approach then it might be worth looking at either GitHub - lucamug/elm-starter: An Elm-based bootstrapper for Elm applications or https://elm-pages.com/ though I don’t know which would work better for you situation.

1 Like

Usually Google has no problem indexing SPA (see site:https://package.elm-lang.org - Google Search for example)

Running Lighthouse, I see that often it returns the error “robots.txt is not valid. Lighthouse was unable to download a robots.txt file”. I see robots.txt is there but I wonder why Lighthouse is failing sometime.

Also the sitemap.xml file is very large. I don’t know if this can be an issue too.

What is the “Google Search Console” telling you? Does it read the 7710 entries you have in the sitemap? Something like:

In any case, this seems related to the concept of SPA per se. I don’t think Elm is related to this, as you mentioned in the title

3 Likes

You’re sure package.elm-lang.org does not use server side rendering?

Sitemap is fine.

It may not be Elm, but these are actually 3 sites, none of them indexed. And I have a fourth completely unrelated site I build 2 years ago that also never got properly indexed. As it wasn’t important for that, I hadn’t investigated that one.

I have no clue what’s going on, and either way, people should be quite aware of the issue.

2 Likes

(package.elm-lang.org does not use server side rendering- it’s plain old elm!)

6 Likes

Clicking on “repeat the search with the omitted results included.” seems that Google indexed 42,700 results from the website but all of them have the same title and no description so Google consolidate them into 6 pages.

I am not a SEO expert but I would suggest changing, for each page, both the <title> and the <meta> description, using an Elm port, so that the Google bot can differentiate the indexing. It is a good way to influence how Google search resault looks like. This should be done as quickly as possible, to be safe. I see now that the real content of the page arrives with some delay.

Also multiple URLs should not have the same content, otherwise better to use the “canonical” meta tag. Google doesn’t like duplicated content.

If possible it would be beneficial also to reduce the number of assets for each page. Now a typical page seems to have around: 113 requests, 1,5 MB transferred, 6.8MB resources, finish 7.4s, load 3.18s. I noted several calls to the same script https://platform.twitter.com/widgets.js, I wonder if this can also be improved

9 Likes

I haven’t tried it yet, but netlify also has a prerendering feature now.

The node path isn’t used. These are old Drupal paths, and reindexed with indeed no title. We have fixed that, but that doesn’t appear to have done anything. You will notice that the current state has proper titles, and canonical tags, etc, and even for reindexed pages when we look at what Google says it has cached/loaded, it’s just the index.html.

1 Like

Thanks for pointing out the performance. We measured that without Twitter, but Google sees the Twitter widget, and that indeed seems to hurt performance badly. I’ll see if I can avoid loading that for the Google bot.

1 Like

Do you use a header with CSP?

My “webserver” CouchDB added CSP in the response-header after an update and Chromium did not render my SPA anymore, while Fixefox did. It may not be related, but as Google uses Chrome for executing JS, it maybe worth to check.

1 Like

Unfortunately, I cannot offer you a solution, I just want to tell you that I manage an Elm SPA e-commerce-site which Google has no problem indexing the site.

6 Likes

No, not having any CSP header.

All SPA’s suffer this problem. What I recommend is elm-pages which will produce a static site that Google can crawl. You can rehydrate your model and enjoy all the SPA goodness.

3 Likes

That is not true, we are running multiple SPAs being successfully indexed by google. All built in Elm.
Some years ago Google was using Chrome 41 to render and that caused problems if your site didn’t work in old browsers. Since 2019 Google uses the latest Chromium to render and since then we haven’t had any problems with our sites.

In this case I would guess there is some error preventing google from rendering the site properly. There is 1 error reported in the javascript console, it might be worth fixing.

Would you mind checking the following in Google Search Console and post some screen shots?

Which crawler is used? (Check settings)

If you do a live test, do you get any javascript errors or other suspicious things?

Javascript errors should be displayed here:

image

Also check the Page Resources section, sometimes the API where the actual content is fetched is blocked or have problems (for example CORS or timeouts).

9 Likes

Thank you for sharing this. I’m definitely going to test this. Are you rendering your pages with all the usual on page SEO? Are you having to do anything different?

I don’t think we are doing anything unusual to make our SPAs crawlable nowdays, here is what we ususally think about:

  • Make sure links are rendered as <a href="..."> elements, don’t update url with onClick events on links you want google to find.

  • Make sure navigation menu links are in the document even if they are closed/not visible. Google renders javascript but it does not “interact” with the app like hover, click etc.

  • Include the <meta robots="all"> tag as well as <title> and <meta description=".."> on all pages.

  • Register your site in Search Console and submit a sitemap.xml.

  • Make sure the page can be rendered reasonably fast. Google has some kind of rendering budget for each site/page and we have noticed problems when the page is too slow, for example when content is fetched through a long chain of API requests. @lucamug made a nice writeup of some of his findings regarding this.

  • We have noticed that SPA sites without server side rendering takes more time to get indexed, somewhere in the ballpark of a few weeks for crawling a sitemap.xml with 1k urls. If content changes fast and organic traffic is an important source you might look into something like rendertron, we use that on some sites and it makes a difference on number of crawled pages per day.-

5 Likes

I went through some of this, so I can offer my perspective:

I rebuilt an e-commerce site that had previously been a mishmash of Wordpress and PHP nonsense with Elm. When I initially built the site, for some reason I was having trouble getting routing to work, so my application used # anchors for routing - i.e., the “inventory” page was at https://mysite.com/#inventory

Google did not index the site, and we disappeared from the Internet. (This was an unpopular outcome, and made a lot of people very unhappy.)

The initiative to fix this encapsulated the following actions:

  • I created and submitted a sitemap.xml, and made sure to refresh it when the contents of my site updated
  • I stopped using # anchors and figured out how to do regular routing
  • I changed all of our URL paths to be more descriptive - for instance, https://mysite.com/inventory/18507 became https://mysite.com/inventory/18507-allen-bradley-relay-model-2
  • I built a “sitemap” into the website - I made sure that it was possible to access each inventory listing from a link on the main page by traversing through the different organizational units of the site (i.e., category → brand → make → model → inventory detail page) - from what I’ve read, Google really likes this, nearly as much as a sitemap
  • I added a robots.txt and filled it out
  • I added RDFa tags to take advantage of Google Rich Results: https://search.google.com/test/rich-results

It took a few months, but over time Google gradually indexed more and more of our pages. I don’t use server-side rendering, or really anything at all very fancy - just a fairly static site hosted on nginx that loads my Elm JS from an index.html page.

I don’t know how much of this applies to you, directly or otherwise, but in my case - it worked, gloriously, and led to a significant increase in business for our company.

5 Likes

Thank you so much for your screenshots, made it very easy to follow.

  1. Crawling: Googlebot smartphone (exactly same line as you have).

  2. JavaScript errors with live test: none for our site. But get some for the Twitter widget on this page, such as

Error

Access to fetch at 'https://syndication.twitter.com/settings?session_id=2b6844d02721225f514074239168e8be1b0618a2' 
from origin 'https://platform.twitter.com' has been blocked by CORS policy: 
No 'Access-Control-Allow-Origin' header is present on the requested resource. 
If an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled.

I don’t think that is relevant. Or is it?

We had some discussion here if we could disable the twitter bits when we detect a bot is running, but we have the concern that Google might not like us doing that, i.e. we present a different page to the bot than to the world, and Google might penalise us for that.

Anyway, with the live url we get the proper page (we already had checked this initially), but the indexed HTML is simply our index.html, with no javascript loaded.

1 Like