Identifying Entities Is Not Enough (Part 2)
After learning that Google uses entities to decide which webpages to show in the search engine results, I made a Python script for entity analysis. I explained the concept behind this entity analysis in Part 1. The script uses the Python library TextBlob to count proper nouns and unique proper nouns, which are types of entities. It also calculates the entity density and the unique entity density, metrics that I calculated by dividing the entity counts by the word count of the articles. Then I used the script to analyze the blog posts on several challenger banks’ websites.
Setting Up the Entity Analysis Experiment
In this experiment, I analyzed the text from 12 blog posts on the websites of four challenger banks. These banks were Revolut, Starling Bank, Tandem Bank, and Atom Bank. All four of the banks I selected were British. Challenger banks are more popular in Britain than they are in other countries so it was easier to perform the comparison with British challenger banks. Many of them have been around for several years now and have lots of blog posts up on their websites.
So I cut and pasted 48 articles from the challenger banks’ blogs and copied them into the integrated development environment (IDE) I was using for the analysis. I’m using the Python IDE Spyder to run my experiments. The scripts I’m making can be converted into stand-alone programs that run by themselves as executable files, but those packages take up more disk space so I usually don’t do that.
After collecting the statistics from each article, I recorded them in an OpenOffice spreadsheet. I also used the Ahrefs SEO toolbar to collect the keyphrase count and organic search traffic estimate for each article.Then I used the linear regression function LINEST to calculate whether the entity analysis metrics predicted the Ahrefs SEO metrics. The idea was that the R squared value would show the size of the contribution from each entity-related metric to the SEO metrics.
The Experiment Results
The entity mentions did not affect the keyphrase rankings as much as I expected. For entities, the R^2 value was 0.016. So the total number of entities per article had a 1.6 percent correlation with the number of keyphrases reported by Ahrefs. For named entities, this value was 0.0095. For entity density, the value was 0.0097. And for named entity density, the value was 0.0176.
There was a closer correlation between entity mentions and search traffic. For entities, the R^2 value was 0.0316. This measurement indicates that the number of entities per article predicted 3.16 percent of organic search traffic. But the correlations between other measurements and search traffic were very low. For unique entities the value was 0.0000. For unique entity density the value was 0.0067. For unique entities the value was 0.0110.
I used a script that identified proper nouns rather than an API provided by Google or Microsoft that returned entities from a database. This type of analysis might work better with an official Google API that can recognize other types of entities.
Additionally, the script wasn’t able to measure the importance or relevance of the entities being analyzed. And that factor could make a difference. Many of the challenger banks’ blogs mention small businesses that the banks helped during the pandemic, for example. But those businesses aren’t big enough to have their own Wikipedia entries, and many of them are in sectors that aren’t directly related to banking either. So mentioning them might not help a bank rank for financial keywords, even if the articles are interesting to read and explain how the bank helps its customers. Again, it would be difficult to obtain relevance or notability data without another source of data, although SEO tools might be useful in this area.
Keyphrases and backlinks are still important ranking factors. And there are other content ranking factors besides entities, including the credentials of the author and whether the article itself is well-written and grammatically correct. So I didn’t expect any type of entity analysis to report an R^2 value of close to 1, as a result like that would suggest that entities were the most important ranking factor. Even if I improve my entity analysis methods, the other factors will retain their importance.
Relationships Between Entities
Google isn’t just analyzing entities, it’s analyzing the relationships between them. And these relationships are stored in databases as triples using the subject, object, predicate model. For example, in the sentence “The light turned red” the light is the subject, turned is the predicate, and red is the object.
It would be possible to identify these types of relationships using TextBlob. The subject is likely to be a noun near the beginning of the sentence. The predicate contains a verb and may include multiple words. And the object is typically located near the end of the sentence. So TextBlob can identify the parts of speech for the entire sentence.
Google can also use the subject, object, predicate statements for automated fact checking. So an article could be considered credible if the statements in it match the ones in the databases that Google is using. Wikipedia itself isn’t designed to provide structured data for fact checking, but Wikimedia does maintain a database called Wikidata that can be used for this purpose. And Full Fact already uses Wikidata to support its automated fact checking service.
The experiment didn’t support the claim that adding more entities to an article will make it rank for more keyphrases. Stuffing an article with entities wouldn’t be a useful SEO strategy by itself. And attempts to improve the analysis by introducing metrics like entity density and unique entity count did not help much. The total number of entities was the most relevant ranking factor, considering both keyphrases and organic traffic, in this analysis.