{"id":1855,"date":"2024-03-26T04:05:27","date_gmt":"2024-03-26T04:05:27","guid":{"rendered":"https:\/\/gccwebsites.com\/seoblog\/?p=1855"},"modified":"2024-05-15T04:49:47","modified_gmt":"2024-05-15T04:49:47","slug":"using-code-to-read-through-websites","status":"publish","type":"post","link":"https:\/\/gccwebsites.com\/seoblog\/2024\/03\/26\/using-code-to-read-through-websites\/","title":{"rendered":"Using Code to Read Through Websites"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"843\" height=\"341\" src=\"https:\/\/gccwebsites.com\/seoblog\/wp-content\/uploads\/2024\/03\/image-13.png\" alt=\"\" class=\"wp-image-1856\" srcset=\"https:\/\/gccwebsites.com\/seoblog\/wp-content\/uploads\/2024\/03\/image-13.png 843w, https:\/\/gccwebsites.com\/seoblog\/wp-content\/uploads\/2024\/03\/image-13-300x121.png 300w, https:\/\/gccwebsites.com\/seoblog\/wp-content\/uploads\/2024\/03\/image-13-768x311.png 768w\" sizes=\"auto, (max-width: 843px) 100vw, 843px\" \/><\/figure>\n\n\n\n<p>As we&#8217;ve been learning more about SEO crawlers have become an interst of mine. As an overview, web crawlers&nbsp;download and index content from all over the Internet. The goal is to learn what webpages on the web are about. To understand how these bots work I built a rudimentary crawler in Java.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Accessing the HTML<\/strong><\/h2>\n\n\n\n<p>In order to start the indexing process I need to have access to the website&#8217;s HTML source code. The way I did this was by inputting a starter URL. This URL acted like a doorway to the rest of the site. Using libraries like Java&#8217;s <a target=\"_blank\" rel=\"noreferrer noopener\" href=\"http:\/\/java.net\">java.net<\/a>.URL my code established a connection to the web server hosting the doorway URL. From there I used a series of scanners to pull the full HTML source text to a .txt file. This raw HTML is the foundation for the website and is where I drwe all of my information from.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Parsing the HTML<\/strong><\/h2>\n\n\n\n<p>With the HTML, my program scanned through it looking for web addresses that were designated by hreff tags. Utilizing the java.util.regex parsing library I sifted through the HTML looking for those hreffs.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Storing Web Addresses<\/strong><\/h2>\n\n\n\n<p>If the scanner encountered a URL within anchor (&lt;a&gt;) tags or other elements, it copied those tags to a separate .txt file to store to use as separate doorways. This repository served as a set of doorways. After the initial code was indexed I would move on to the links in the .txt file repeating the same process.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Unlocking the Secrets of the Web<\/strong><\/h2>\n\n\n\n<p>The ability to read through a website&#8217;s HTML and store web addresses opens doors to a world of possibilities. Through code much more complicated than my own, index the whole internet. It was super interesting being able to build this kind of crawler and see some of the elements that make SEO possible.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>As we&#8217;ve been learning more about SEO crawlers have become an interst of mine. As an overview, web crawlers&nbsp;download and index content from all over the Internet. The goal is to learn what webpages on the web are about. To understand how these bots work I built a rudimentary crawler in Java. Accessing the HTML [&hellip;]<\/p>\n","protected":false},"author":47,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-1855","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.3 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Using Code to Read Through Websites - ENTR 330 SEO Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/gccwebsites.com\/seoblog\/2024\/03\/26\/using-code-to-read-through-websites\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Using Code to Read Through Websites - ENTR 330 SEO Blog\" \/>\n<meta property=\"og:description\" content=\"As we&#8217;ve been learning more about SEO crawlers have become an interst of mine. As an overview, web crawlers&nbsp;download and index content from all over the Internet. The goal is to learn what webpages on the web are about. To understand how these bots work I built a rudimentary crawler in Java. Accessing the HTML [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/gccwebsites.com\/seoblog\/2024\/03\/26\/using-code-to-read-through-websites\/\" \/>\n<meta property=\"og:site_name\" content=\"ENTR 330 SEO Blog\" \/>\n<meta property=\"article:published_time\" content=\"2024-03-26T04:05:27+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-05-15T04:49:47+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/gccwebsites.com\/seoblog\/wp-content\/uploads\/2024\/03\/image-13.png\" \/>\n<meta name=\"author\" content=\"VahlbergCD22\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"VahlbergCD22\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"2 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/gccwebsites.com\\\/seoblog\\\/2024\\\/03\\\/26\\\/using-code-to-read-through-websites\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/gccwebsites.com\\\/seoblog\\\/2024\\\/03\\\/26\\\/using-code-to-read-through-websites\\\/\"},\"author\":{\"name\":\"VahlbergCD22\",\"@id\":\"https:\\\/\\\/gccwebsites.com\\\/seoblog\\\/#\\\/schema\\\/person\\\/f078336d62f192088d7ce45c43801077\"},\"headline\":\"Using Code to Read Through Websites\",\"datePublished\":\"2024-03-26T04:05:27+00:00\",\"dateModified\":\"2024-05-15T04:49:47+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/gccwebsites.com\\\/seoblog\\\/2024\\\/03\\\/26\\\/using-code-to-read-through-websites\\\/\"},\"wordCount\":325,\"commentCount\":2,\"image\":{\"@id\":\"https:\\\/\\\/gccwebsites.com\\\/seoblog\\\/2024\\\/03\\\/26\\\/using-code-to-read-through-websites\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/gccwebsites.com\\\/seoblog\\\/wp-content\\\/uploads\\\/2024\\\/03\\\/image-13.png\",\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/gccwebsites.com\\\/seoblog\\\/2024\\\/03\\\/26\\\/using-code-to-read-through-websites\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/gccwebsites.com\\\/seoblog\\\/2024\\\/03\\\/26\\\/using-code-to-read-through-websites\\\/\",\"url\":\"https:\\\/\\\/gccwebsites.com\\\/seoblog\\\/2024\\\/03\\\/26\\\/using-code-to-read-through-websites\\\/\",\"name\":\"Using Code to Read Through Websites - ENTR 330 SEO Blog\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/gccwebsites.com\\\/seoblog\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/gccwebsites.com\\\/seoblog\\\/2024\\\/03\\\/26\\\/using-code-to-read-through-websites\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/gccwebsites.com\\\/seoblog\\\/2024\\\/03\\\/26\\\/using-code-to-read-through-websites\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/gccwebsites.com\\\/seoblog\\\/wp-content\\\/uploads\\\/2024\\\/03\\\/image-13.png\",\"datePublished\":\"2024-03-26T04:05:27+00:00\",\"dateModified\":\"2024-05-15T04:49:47+00:00\",\"author\":{\"@id\":\"https:\\\/\\\/gccwebsites.com\\\/seoblog\\\/#\\\/schema\\\/person\\\/f078336d62f192088d7ce45c43801077\"},\"breadcrumb\":{\"@id\":\"https:\\\/\\\/gccwebsites.com\\\/seoblog\\\/2024\\\/03\\\/26\\\/using-code-to-read-through-websites\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/gccwebsites.com\\\/seoblog\\\/2024\\\/03\\\/26\\\/using-code-to-read-through-websites\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/gccwebsites.com\\\/seoblog\\\/2024\\\/03\\\/26\\\/using-code-to-read-through-websites\\\/#primaryimage\",\"url\":\"https:\\\/\\\/gccwebsites.com\\\/seoblog\\\/wp-content\\\/uploads\\\/2024\\\/03\\\/image-13.png\",\"contentUrl\":\"https:\\\/\\\/gccwebsites.com\\\/seoblog\\\/wp-content\\\/uploads\\\/2024\\\/03\\\/image-13.png\",\"width\":843,\"height\":341},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/gccwebsites.com\\\/seoblog\\\/2024\\\/03\\\/26\\\/using-code-to-read-through-websites\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/gccwebsites.com\\\/seoblog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Using Code to Read Through Websites\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/gccwebsites.com\\\/seoblog\\\/#website\",\"url\":\"https:\\\/\\\/gccwebsites.com\\\/seoblog\\\/\",\"name\":\"ENTR 330 SEO Blog\",\"description\":\"Course Blog of ENTR 330\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/gccwebsites.com\\\/seoblog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/gccwebsites.com\\\/seoblog\\\/#\\\/schema\\\/person\\\/f078336d62f192088d7ce45c43801077\",\"name\":\"VahlbergCD22\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/1ebb5d3710634d82e49b96f5b98d279ba31611af16eed50f9705087540c12f1c?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/1ebb5d3710634d82e49b96f5b98d279ba31611af16eed50f9705087540c12f1c?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/1ebb5d3710634d82e49b96f5b98d279ba31611af16eed50f9705087540c12f1c?s=96&d=mm&r=g\",\"caption\":\"VahlbergCD22\"},\"url\":\"https:\\\/\\\/gccwebsites.com\\\/seoblog\\\/author\\\/vahlbergcd22\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Using Code to Read Through Websites - ENTR 330 SEO Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/gccwebsites.com\/seoblog\/2024\/03\/26\/using-code-to-read-through-websites\/","og_locale":"en_US","og_type":"article","og_title":"Using Code to Read Through Websites - ENTR 330 SEO Blog","og_description":"As we&#8217;ve been learning more about SEO crawlers have become an interst of mine. As an overview, web crawlers&nbsp;download and index content from all over the Internet. The goal is to learn what webpages on the web are about. To understand how these bots work I built a rudimentary crawler in Java. Accessing the HTML [&hellip;]","og_url":"https:\/\/gccwebsites.com\/seoblog\/2024\/03\/26\/using-code-to-read-through-websites\/","og_site_name":"ENTR 330 SEO Blog","article_published_time":"2024-03-26T04:05:27+00:00","article_modified_time":"2024-05-15T04:49:47+00:00","og_image":[{"url":"https:\/\/gccwebsites.com\/seoblog\/wp-content\/uploads\/2024\/03\/image-13.png","type":"","width":"","height":""}],"author":"VahlbergCD22","twitter_card":"summary_large_image","twitter_misc":{"Written by":"VahlbergCD22","Est. reading time":"2 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/gccwebsites.com\/seoblog\/2024\/03\/26\/using-code-to-read-through-websites\/#article","isPartOf":{"@id":"https:\/\/gccwebsites.com\/seoblog\/2024\/03\/26\/using-code-to-read-through-websites\/"},"author":{"name":"VahlbergCD22","@id":"https:\/\/gccwebsites.com\/seoblog\/#\/schema\/person\/f078336d62f192088d7ce45c43801077"},"headline":"Using Code to Read Through Websites","datePublished":"2024-03-26T04:05:27+00:00","dateModified":"2024-05-15T04:49:47+00:00","mainEntityOfPage":{"@id":"https:\/\/gccwebsites.com\/seoblog\/2024\/03\/26\/using-code-to-read-through-websites\/"},"wordCount":325,"commentCount":2,"image":{"@id":"https:\/\/gccwebsites.com\/seoblog\/2024\/03\/26\/using-code-to-read-through-websites\/#primaryimage"},"thumbnailUrl":"https:\/\/gccwebsites.com\/seoblog\/wp-content\/uploads\/2024\/03\/image-13.png","inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/gccwebsites.com\/seoblog\/2024\/03\/26\/using-code-to-read-through-websites\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/gccwebsites.com\/seoblog\/2024\/03\/26\/using-code-to-read-through-websites\/","url":"https:\/\/gccwebsites.com\/seoblog\/2024\/03\/26\/using-code-to-read-through-websites\/","name":"Using Code to Read Through Websites - ENTR 330 SEO Blog","isPartOf":{"@id":"https:\/\/gccwebsites.com\/seoblog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/gccwebsites.com\/seoblog\/2024\/03\/26\/using-code-to-read-through-websites\/#primaryimage"},"image":{"@id":"https:\/\/gccwebsites.com\/seoblog\/2024\/03\/26\/using-code-to-read-through-websites\/#primaryimage"},"thumbnailUrl":"https:\/\/gccwebsites.com\/seoblog\/wp-content\/uploads\/2024\/03\/image-13.png","datePublished":"2024-03-26T04:05:27+00:00","dateModified":"2024-05-15T04:49:47+00:00","author":{"@id":"https:\/\/gccwebsites.com\/seoblog\/#\/schema\/person\/f078336d62f192088d7ce45c43801077"},"breadcrumb":{"@id":"https:\/\/gccwebsites.com\/seoblog\/2024\/03\/26\/using-code-to-read-through-websites\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/gccwebsites.com\/seoblog\/2024\/03\/26\/using-code-to-read-through-websites\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/gccwebsites.com\/seoblog\/2024\/03\/26\/using-code-to-read-through-websites\/#primaryimage","url":"https:\/\/gccwebsites.com\/seoblog\/wp-content\/uploads\/2024\/03\/image-13.png","contentUrl":"https:\/\/gccwebsites.com\/seoblog\/wp-content\/uploads\/2024\/03\/image-13.png","width":843,"height":341},{"@type":"BreadcrumbList","@id":"https:\/\/gccwebsites.com\/seoblog\/2024\/03\/26\/using-code-to-read-through-websites\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/gccwebsites.com\/seoblog\/"},{"@type":"ListItem","position":2,"name":"Using Code to Read Through Websites"}]},{"@type":"WebSite","@id":"https:\/\/gccwebsites.com\/seoblog\/#website","url":"https:\/\/gccwebsites.com\/seoblog\/","name":"ENTR 330 SEO Blog","description":"Course Blog of ENTR 330","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/gccwebsites.com\/seoblog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/gccwebsites.com\/seoblog\/#\/schema\/person\/f078336d62f192088d7ce45c43801077","name":"VahlbergCD22","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/1ebb5d3710634d82e49b96f5b98d279ba31611af16eed50f9705087540c12f1c?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/1ebb5d3710634d82e49b96f5b98d279ba31611af16eed50f9705087540c12f1c?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/1ebb5d3710634d82e49b96f5b98d279ba31611af16eed50f9705087540c12f1c?s=96&d=mm&r=g","caption":"VahlbergCD22"},"url":"https:\/\/gccwebsites.com\/seoblog\/author\/vahlbergcd22\/"}]}},"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/gccwebsites.com\/seoblog\/wp-json\/wp\/v2\/posts\/1855","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/gccwebsites.com\/seoblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/gccwebsites.com\/seoblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/gccwebsites.com\/seoblog\/wp-json\/wp\/v2\/users\/47"}],"replies":[{"embeddable":true,"href":"https:\/\/gccwebsites.com\/seoblog\/wp-json\/wp\/v2\/comments?post=1855"}],"version-history":[{"count":2,"href":"https:\/\/gccwebsites.com\/seoblog\/wp-json\/wp\/v2\/posts\/1855\/revisions"}],"predecessor-version":[{"id":2570,"href":"https:\/\/gccwebsites.com\/seoblog\/wp-json\/wp\/v2\/posts\/1855\/revisions\/2570"}],"wp:attachment":[{"href":"https:\/\/gccwebsites.com\/seoblog\/wp-json\/wp\/v2\/media?parent=1855"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/gccwebsites.com\/seoblog\/wp-json\/wp\/v2\/categories?post=1855"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/gccwebsites.com\/seoblog\/wp-json\/wp\/v2\/tags?post=1855"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}