As we’ve been learning more about SEO crawlers have become an interst of mine. As an overview, web crawlers download and index content from all over the Internet. The goal is to learn what webpages on the web are about. To understand how these bots work I built a rudimentary crawler in Java.
Accessing the HTML
In order to start the indexing process I need to have access to the website’s HTML source code. The way I did this was by inputting a starter URL. This URL acted like a doorway to the rest of the site. Using libraries like Java’s java.net.URL my code established a connection to the web server hosting the doorway URL. From there I used a series of scanners to pull the full HTML source text to a .txt file. This raw HTML is the foundation for the website and is where I drwe all of my information from.
Parsing the HTML
With the HTML, my program scanned through it looking for web addresses that were designated by hreff tags. Utilizing the java.util.regex parsing library I sifted through the HTML looking for those hreffs.
Storing Web Addresses
If the scanner encountered a URL within anchor (<a>) tags or other elements, it copied those tags to a separate .txt file to store to use as separate doorways. This repository served as a set of doorways. After the initial code was indexed I would move on to the links in the .txt file repeating the same process.
Unlocking the Secrets of the Web
The ability to read through a website’s HTML and store web addresses opens doors to a world of possibilities. Through code much more complicated than my own, index the whole internet. It was super interesting being able to build this kind of crawler and see some of the elements that make SEO possible.
2 Responses
Developers for sure can get a look into what the Googlebot is most likely seeing by using code. Uncovering patterns during this process will let you know what is and isn’t working on that website. These skills will help you better understand SEO.
this is all really a good way to analyze any webpage. even just using the chromium browser “Inspect” button can give you information you wouldn’t have had otherwise.