The Human’s Guide to NLP for SEO

What is NLP in SEO? 

NLP (natural language processing) is a subfield of artificial intelligence (AI), linguistics, and tech that defines how computers “read” and analyze natural human language.

It’s how algorithms used by search engines understand text on a website and know what to offer for any given search term.

Search engines (Google, YouTube, DuckDuckGo, and more) use their own variations of NLP to match user intent with search queries.

NLP impacts SEO strategy, writing, and overall content marketing. Understanding how it works will give you a leg up when trying to rank on Google search.

Google’s algorithm, BERT (Bidirectional Encoder Representations from Transformers), collects data sets from webpages, documents, applications, and any piece of content. Once the data sets are available, BERT uses Google’s NLP to read, understand, and deliver content within the data.

NLP is sort of like BERT’s brain. 🧠

Search engine optimization is based on working within the basics of NLP to create quality content for search engines.

From Google’s perspective, NLP is all about improving the user experience on the SERP (search engine results page). The goal is to give users the best answer the first time they enter a query.

The more people search for a target keyword and interact with the results, the more precisely Google delivers pages that meet the user’s search intent. It’s trickier to define the intent for long-tail keywords that aren’t searched often.

10 Steps in Google’s NLP

Google doesn’t just rely on a list of keywords to figure out what that page is about — it hasn’t operated that way for a decade or more. Instead, it uses NLP to determine the syntax of each sentence within the context of the page. 

There are 10 basic parts to NLP within Google’s BERT algorithm:

  1. Sentence segmentation
  2. Tokenization
  3. Parts-of-speech tagging and chunking
  4. Morphology
  5. Word Dependency Trees
  6. Stemming and lemmatization
  7. Stop words removal
  8. Parse labeling
  9. Named entity extraction and recognition
  10. Subject categorization

Depending on who you ask, the process may look a bit different. For instance, there are 5 phases of NLP when broadly defined that apply to Google’s algorithm, too:

  1. Lexical (structural) analysis
  2. Parsing
  3. Semantic analysis
  4. Discourse integration
  5. Pragmatic analysis

I’ll use an example sentence to help explain some parts:

10% of adults are left-handed.

1. Sentence Segmentation

Sentence segmentation breaks a page down into distinct sentences.

This is the process of determining the full sentences within a larger body of text.

For instance, if I say, “A D.C. resident shares his favorite 3 restaurants.”, sentence segmentation helps define that as a full sentence (rather than stopping at every period).

2. Tokenization

Tokenization breaks a sentence down into distinct parts.

For instance, each word, punctuation mark, number, and symbol is considered a separate “token.”

The % symbol, dash in “left-handed,” and period at the end of the sentence are separate tokens (in addition to words and numbers).

Here’s how our example sentence is tokenized:

3️. Parts-of-Speech Tagging & Chunking

In this step, each token is given a defined part of speech or function. Then, multiple words are combined into “chunks” when identified as a single term (e.g., “United States” is a single term, not just two words).

The POS classes in NLP (not including subclasses) are:

  • Noun (N)
  • Verb (V)
  • Adjective (ADJ)
  • Adverb (ADV)
  • Preposition (P)
  • Conjunction (CON)
  • Pronoun (PRO)
  • Interjection (INT)

Our example sentence looks like this with parts-of-speech tagging:

Jocelyn D’Souza wrote a great article on how POS tagging and chunking work within NLP programming.

4. Morphology

Morphology takes the part-of-speech definition for each word and determines how a particular language modifies words to create different meanings.

Google’s morphological information includes how to “read” words in English, Chinese, Japanese, Russian, Korean, and gendered languages.

In our example sentence, morphology defines:

  • “Adults” as a PLURAL number
  • “Are” as an INDICATIVE verb in the PRESENT tense
  • “Left” as a PAST tense verb in PASSIVE voice
  • “Handed” as a PAST tense verb

Further steps in the NLP process will likely combine the separate tokens in “left-handed” so that the verb tense and voice are no longer included in how to define these terms.

5. Word Dependency Trees

Once the parts-of-speech and morphology of each word in a sentence are defined, Google’s algorithm creates a word dependency tree.

This tree displays the directions each word relates to another in a sentence.

Here’s the tree for our example sentence:

6. Stemming & Lemmatization

Stemming and lemmatization groups and converts words with similar stems and meanings.

This includes things like:

  • Removing “ing” from verbs (drop = dropping)
  • Resolving tenses (am, are, is = be)
  • Consolidating word forms (cats, cat’s, cats’, cat = cat)

In our example, “adults” and “are” become lemmatized to “adult” and “be”:

7. Stop Words Removal

In the stop words removal step, the algorithm ignores words that add no information to the sentence.

Depending on the algorithm, different words may be ignored.

This list on GitHub gives a great example of what stop words may be removed from a given algorithm.

8. Parse Labeling

Parse labeling defines how (not just the direction) words and phrases relate to one another.

Check out page 11 of the Stanford paper that defines these labels in more detail to see the full list.

Here’s what our example sentence looks like with parse labeling:

9. Named Entity Extraction & Recognition

Named entity extraction highlights “known” entities (anything the algorithm recognizes as a distinct thing). Then, named entity recognition determines what that thing is, based on the overall content on the page. 

In our sample sentence, “adults” and “left-handed” are entities that may be known. Entities may include a brand name, a person’s name, a place, a title, any pop culture term… the list goes on. 

Google constantly learns new entities as the AI gets smarter via machine learning.

For each entity, Google’s algorithm also defines:

  • Type: What type of entity this is (unknown, person, location, organization, event, work of art, product, other, phone number, address, date, number, or price)
  • Metadata: Usually a Wikipedia URL and/or a Knowledge Graph, if available
  • Mentions: The number of times a common noun, proper name, or unknown entity type is mentioned on the page
  • Salience: Rates the strength with which an entity is related to the overall page’s text (calculated from 0-1.0). For instance, an article about brushing your teeth might give a high salience score to terms like flossing, toothpaste, or gum tissue, but low scores to entities such as Nebraska, YouTube, and therapy dog.
  • Sentiment: Defines the positive vs. negative overall sentiment of an entity on the page (calculated from -1.0 to 1.0). Some Google research suggests that deeper levels of sentiment may be assigned, particularly to user-generated content (UGC), based on tags like “question,” “answer,” “humor,” etc.

Let’s try another example sentence to show more about how entities relate:

“Leonardo Da Vinci painted the Mona Lisa in 1503.”

Here’s the entity recognition and salience scores for our new sentence:

And here’s the sentiment analysis:

10. Subject Categorization

Google uses subject categories to understand where a page or piece of content belongs within the larger internet user experience. This is scored with a “confidence rating” of 0-1.0.

The more specific the category, the better.

For example, a page with a music video may be classified as /Arts & Entertainment/Music & Audio/Music Videos, which is preferred to the broader /Arts & Entertainment category.

One simple sentence is difficult to classify. However, adding more information and creating a new example paragraph, we can see Google’s API classify the category of our content.

“Leonardo Da Vinci painted the Mona Lisa in 1503. He was a famous artist and inventor who invented the first flying machine.”

You can try Google’s Natural Language API demo out for yourself to see how it breaks down a sentence or two in your writing.

How to Write Better with NLP

The reason Google invests so much in their algorithm is because it makes for a better user experience. Gone are the days of keyword stuffing, black-hat backlink building, and other “tricks” to get your website to appear at the top of search results. 

The BERT update in 2019 is the latest in Google’s advancements towards a better search experience.

Today, if you want to succeed in winning long-term organic traffic for your website, you need to start with high-quality content.

Here’s how understanding NLP can help:

Write for the end user, not just an algorithm. (Seriously!)

Writing well for an algorithm sounds pretty lame. And in reality, your end user is the person who will most benefit from high-quality content.

Start every conversation, from strategy to writing and editing to uploading, with how your content will create value for the person who lands on your page.

(Every other item on this list will relate back to this first tip in one way or another.)

Think about:

  • Who your user is (including demographics, career, cultural background, physical location, habits, preferences, etc.)
  • Where your user is when they Google this keyword (Are they cooking? At the gym? In line at the DMV? On the toilet? At work?)
  • What emotional state they’re in (Confused? Scared? Bored? Curious? Neutral?)
  • The specific use cases they can take the information from your article and apply it in their own context (Are they looking for a product to buy? Information to share with friends? Scientific breakthroughs for a medical concern?)

The more you know about who you’re talking to and how they can use the information you provide, the better your SEO content will be.

Choose the best SEO tools.

The right tools can help you immensely when creating briefs and writing content.

For instance, Ahrefs is great at assisting your keyword research by showing how different keywords are related to one another. Clearscope is my go-to for on-page optimization, with data-driven terms to use within your writing.

When you use these tools, you’re better able to understand what topics, related keywords, and people also ask (PAA) questions your page should include.

SEO experts often have their favorites, and many of these tools share features and methodology. In other words, there’s no “perfect” tech stack for SEO — it’s just important to find out what works for you.

Featured snippets are small pieces of information Google’s algorithm have determined are particularly relevant to a specific search query. They appear at the top of the SERP and can come in multiple forms, including:

  • One or more sentences in paragraph form
  • Images
  • Bulleted/numbered lists
  • Video snippets
  • Tables

To claim the top spot for a featured snippet, mirror the format of the existing snippet in your article (without plagiarizing or copying it directly).

For instance, if the featured snippet is a video, create your own video. If it’s a bulleted list of specific terms or definitions, use a bulleted list.

To make it super meta, here’s a featured snippet about featured snippets:

Answer relevant PAA (People also ask) questions.

The People also ask (PAA) section of the SERP is rich with questions users ask that may be related to a larger search query.

To best answer question-format queries, use the words in the query in your answer. (Yes, your English teacher might be a little appalled — but it works!)

I wrote a little more about this on LinkedIn:

Bonus: Optimizing for PAAs will increase your chances of being featured in voice search responses. Most voice searches are simple question/answer queries.

Don’t wait to answer the searcher’s intent.

Whenever you can, start sections with the answer to a question — don’t wait.

If you’re writing an article targeting the term “is coffee keto-friendly,” answer the question in the first paragraph on the page. 

We’re not writing books here. Don’t make your reader digest the entire page to get the simple answer to their question.

Write shorter sentences.

Short sentences are easiest for NLP algorithms to read. When writing for SEO, focus on just one idea per sentence, and state it as simply as you can.

This will reduce the complexity of sentence structuring with NLP.

For instance, if you’re trying to explain the winter temperatures in Alaska, look at the ways these two sentences break down:

“Winter temperatures in Alaska range from 0°F / -18°C to -30°F / -35°C from November to March.”
“Alaska's winter temperatures range from 0°F to 30°F.”

While both sentences are well-written and factually correct, the second has fewer dependency hops and entities to identify.

Use bulleted and numbered lists.

Lists make for scannable content. Since 63% or more organic search visits are on mobile devices, scannability is a major factor likely to influence dwell time.

Here are my personal rules for using lists to make better content:

  • Use numbers for ordered lists and bullets all other times. If a list of items comes in a particular order (such as recipe instructions), number the list. If the list doesn’t need to occur in a particular order (like symptoms of a disease), use bullets.
  • If there are more than three items, use a list. Instead of a sentence with a first, second, third, and fourth item, why not make it a list?
  • When list items need details, use bolded text for the main item, then unbold the description. I’m doing this right now! It helps to keep your detailed list items from become massive, overwhelming blocks of text.

Use schema markup correctly.

For pages with common information, Google’s NLP can read schema markup (structured data) to display rich information on the SERP for a search result.

Properly applied schema markup to define structured data will improve your user’s experience and win you points with Google’s algorithm.

Common types of structured data include:

  • E-commerce products (price, reviews, images, etc.)
  • Book details (author, summary, reviews, publisher, etc.)
  • Recipes (cuisine, ingredients, instructions, calories, etc.)
  • FAQ (question/answer)
  • Local business (open hours, address, phone number, etc.)

That’s definitely not an exhaustive list. Check out SEMRush’s page on structured data for a longer breakdown of how to do it the right way.

Plan a content strategy around topical authority.

From a broader perspective, the content on your website should serve specific areas of topical authority. As Google’s NLP “reads” your content, it will give preference to your pages that are supported by other pages on similar topics.

As you publish content, take advantage of internal linking to connect pages that relate to one another. Every time a page is crawled, Google follows the internal links to establish these connections and understand more about your website’s authority.

Learning content-led SEO? Join me!

I’m Rebekah, and I’d love to help you learn how to do content-led SEO for humans (and algorithms… sometimes 😉).

Follow me on LinkedIn and Twitter for fun tips (and a bit of sass) on SEO. See you there!

Follow Me on LinkedIn

Read Next: Meta Descriptions = Small SEO Impact 

You May Also Like