Skip to content

Latest commit

 

History

History
40 lines (27 loc) · 12.6 KB

File metadata and controls

40 lines (27 loc) · 12.6 KB

V2

Public Website Inventory

Site Scanning Engine

  • (a) Site Scanning source file - A large list of websites, which is harvested and combined with others to make the Site Scanning website index. Source files may be static or dynamic. More details about each source file can be found here.
  • (the) Site Scanning website index - The total list of websites which the Site Scanning engine scans every day to generate the Site Scanning results. The index is created by combining all of the source files and is located at https://github.com/GSA/federal-website-index/blob/main/data/site-scanning-target-url-list.csv. The Public Website Inventory is a subset of this index.
  • (a) Site Scanning scan - A set of analyses that take place by loading an initial url and performing certain actions.
  • (the) Site Scanning results - The combined data that is generated by the scans that take place each day.

Site Scanning Publications

  • (a) Site Scanning snapshot - An export of the Site Scanning results, either in full, or trimmed to a certain subset for convenience.
  • (a) Site Scanning report - An analysis that takes Site Scanning results and runs certain counts or calculations against it for a specific stakeholder.

General

  • (a) snapshot - A static copy of a dataset generated to provide a point in time view of the data.

V1

Term Definition
.gov registry The program, operated by GSA, that administers the .gov top level domain. Found at www.dotgov.gov, this program is how federal, state, tribal, and local government agencies register .gov websites.
List of .gov domains The official public list of registered .gov domains (link). This includes all registered domains from the .gov registry, spanning federal, state and local government registrations.
List of federal .gov domains The subset of the list of federal .gov domains that are registered by federal agencies (link).
Domain A registered second level domain, such as fbi.gov, state.gov, or commerce.gov. Domains are registered and serve as the foundation on which one or many websites are arranged. The Site Scanning program is focused exclusively on domains that have been registered by federal agencies, all of which together make up the list of federal .gov domains.
Subdomain A distinct location set up within a domain, such as www.fbi.gov, vault.fbi.gov, or forms.fbi.gov. All subdomains are configured by the domain owner to point to individual websites or other web services. One domain may have hundreds of subdomains. The Site Scanning program attempts to identify all publicly accessible subdomains that exist on federal .gov domains. The goal of this is to quantify the entire federal gov web presence.
Website A collection of webpages, usually accessible on a subdomain, that together make up a discrete web property. Often, they will be operated by a distinct web server. Several subdomains may blend together by having a similar look and feel, so as to give the site visitor a sense of visiting just one website. Examples of this might be results.usaid.gov and www.usaid.gov or blog.ed.gov and www.ed.gov. But the differing subdomains often indicate that different servers or systems power each subdomain and thus, each is considered a distinct website for the purpose of this program. For the purposes of the Site Scanning program, a website is a live subdomain that does not redirect, is publicly accessible, and is intended for consumption by a web browser. Examples of live subdomains that are not considered websites include: API endpoints, email servers, redirects, [what else?]
Site Scan An automated script that targets a list of publicly accessible URLs, loads each one, and analyzes them in order to generate data about the web service that exists at those locations.
List of Target URLs The combined list of all known, live, federal, .gov subdomains. Each Target URL is then individually scanned in order to generate the data published by the Site Scanning program.
Target URL The URL of an individual subdomain that is scanned.
Final URL The URL that resolves after a web browser loads Target URL and follows any redirects that take place. If the Target URL does not redirect, it's Final URL will be the same. If it does redirect, then the Final URL will be different. The combined list of final URLs represents the totality of the federal .gov web presence, as best as we know.