We converted a Drupal site, and used Elm for the front-end. However Google does not see our site as an SPA as far as we can tell, so it only sees and indexes the index.html.
We have provided a sitemap but that didn’t help (we had this in place at launch). And from poking around in the Google Search Console it simply appears Google only sees the index.html content, which contains nothing. It does not launch a browser and wait for the content to be rendered.
We have decided to prerender our pages, as clearly Google does not recognise SPAs. We had done a ton of research, and from what we had seen Google does handle SPAs. But clearly it does not.
Is there any help the Elm community can give to see why Google does not recognise our SPA?
I’m not sure if this is the problem in this case but the normal googlebot and the chromium based googlebot that runs JS are different and the chromium one used to run with a delay compared to the non-js bot.
If you transitioned from the old page to the new then maybe the chromium-googlebot still didn’t have time to crawl the site with JS?
There may also be a wide variety of problems with the site, I’d start by looking into something like these:
The site has been up for months, and pages are frequently crawled. But it just sees the index.html. Definitely does not fire up any kind of real browser.
Running Lighthouse, I see that often it returns the error “robots.txt is not valid. Lighthouse was unable to download a robots.txt file”. I see robots.txt is there but I wonder why Lighthouse is failing sometime.
Also the sitemap.xml file is very large. I don’t know if this can be an issue too.
What is the “Google Search Console” telling you? Does it read the 7710 entries you have in the sitemap? Something like:
It may not be Elm, but these are actually 3 sites, none of them indexed. And I have a fourth completely unrelated site I build 2 years ago that also never got properly indexed. As it wasn’t important for that, I hadn’t investigated that one.
I have no clue what’s going on, and either way, people should be quite aware of the issue.
Clicking on “repeat the search with the omitted results included.” seems that Google indexed 42,700 results from the website but all of them have the same title and no description so Google consolidate them into 6 pages.
I am not a SEO expert but I would suggest changing, for each page, both the <title> and the <meta> description, using an Elm port, so that the Google bot can differentiate the indexing. It is a good way to influence how Google search resault looks like. This should be done as quickly as possible, to be safe. I see now that the real content of the page arrives with some delay.
Also multiple URLs should not have the same content, otherwise better to use the “canonical” meta tag. Google doesn’t like duplicated content.
If possible it would be beneficial also to reduce the number of assets for each page. Now a typical page seems to have around: 113 requests, 1,5 MB transferred, 6.8MB resources, finish 7.4s, load 3.18s. I noted several calls to the same script https://platform.twitter.com/widgets.js, I wonder if this can also be improved
The node path isn’t used. These are old Drupal paths, and reindexed with indeed no title. We have fixed that, but that doesn’t appear to have done anything. You will notice that the current state has proper titles, and canonical tags, etc, and even for reindexed pages when we look at what Google says it has cached/loaded, it’s just the index.html.
Thanks for pointing out the performance. We measured that without Twitter, but Google sees the Twitter widget, and that indeed seems to hurt performance badly. I’ll see if I can avoid loading that for the Google bot.
My “webserver” CouchDB added CSP in the response-header after an update and Chromium did not render my SPA anymore, while Fixefox did. It may not be related, but as Google uses Chrome for executing JS, it maybe worth to check.
Unfortunately, I cannot offer you a solution, I just want to tell you that I manage an Elm SPA e-commerce-site which Google has no problem indexing the site.
All SPA’s suffer this problem. What I recommend is elm-pages which will produce a static site that Google can crawl. You can rehydrate your model and enjoy all the SPA goodness.
That is not true, we are running multiple SPAs being successfully indexed by google. All built in Elm.
Some years ago Google was using Chrome 41 to render and that caused problems if your site didn’t work in old browsers. Since 2019 Google uses the latest Chromium to render and since then we haven’t had any problems with our sites.
In this case I would guess there is some error preventing google from rendering the site properly. There is 1 error reported in the javascript console, it might be worth fixing.
Would you mind checking the following in Google Search Console and post some screen shots?
Also check the Page Resources section, sometimes the API where the actual content is fetched is blocked or have problems (for example CORS or timeouts).
Thank you for sharing this. I’m definitely going to test this. Are you rendering your pages with all the usual on page SEO? Are you having to do anything different?
I don’t think we are doing anything unusual to make our SPAs crawlable nowdays, here is what we ususally think about:
Make sure links are rendered as <a href="..."> elements, don’t update url with onClick events on links you want google to find.
Make sure navigation menu links are in the document even if they are closed/not visible. Google renders javascript but it does not “interact” with the app like hover, click etc.
Include the <meta robots="all"> tag as well as <title> and <meta description=".."> on all pages.
Register your site in Search Console and submit a sitemap.xml.
Make sure the page can be rendered reasonably fast. Google has some kind of rendering budget for each site/page and we have noticed problems when the page is too slow, for example when content is fetched through a long chain of API requests. @lucamug made a nice writeup of some of his findings regarding this.
We have noticed that SPA sites without server side rendering takes more time to get indexed, somewhere in the ballpark of a few weeks for crawling a sitemap.xml with 1k urls. If content changes fast and organic traffic is an important source you might look into something like rendertron, we use that on some sites and it makes a difference on number of crawled pages per day.-
I went through some of this, so I can offer my perspective:
I rebuilt an e-commerce site that had previously been a mishmash of Wordpress and PHP nonsense with Elm. When I initially built the site, for some reason I was having trouble getting routing to work, so my application used # anchors for routing - i.e., the “inventory” page was at https://mysite.com/#inventory
Google did not index the site, and we disappeared from the Internet. (This was an unpopular outcome, and made a lot of people very unhappy.)
The initiative to fix this encapsulated the following actions:
I created and submitted a sitemap.xml, and made sure to refresh it when the contents of my site updated
I stopped using # anchors and figured out how to do regular routing
I changed all of our URL paths to be more descriptive - for instance, https://mysite.com/inventory/18507 became https://mysite.com/inventory/18507-allen-bradley-relay-model-2
I built a “sitemap” into the website - I made sure that it was possible to access each inventory listing from a link on the main page by traversing through the different organizational units of the site (i.e., category → brand → make → model → inventory detail page) - from what I’ve read, Google really likes this, nearly as much as a sitemap
It took a few months, but over time Google gradually indexed more and more of our pages. I don’t use server-side rendering, or really anything at all very fancy - just a fairly static site hosted on nginx that loads my Elm JS from an index.html page.
I don’t know how much of this applies to you, directly or otherwise, but in my case - it worked, gloriously, and led to a significant increase in business for our company.
Thank you so much for your screenshots, made it very easy to follow.
Crawling: Googlebot smartphone (exactly same line as you have).
JavaScript errors with live test: none for our site. But get some for the Twitter widget on this page, such as
Error
Access to fetch at 'https://syndication.twitter.com/settings?session_id=2b6844d02721225f514074239168e8be1b0618a2'
from origin 'https://platform.twitter.com' has been blocked by CORS policy:
No 'Access-Control-Allow-Origin' header is present on the requested resource.
If an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled.
I don’t think that is relevant. Or is it?
We had some discussion here if we could disable the twitter bits when we detect a bot is running, but we have the concern that Google might not like us doing that, i.e. we present a different page to the bot than to the world, and Google might penalise us for that.
Anyway, with the live url we get the proper page (we already had checked this initially), but the indexed HTML is simply our index.html, with no javascript loaded.