We only want to get some information about the URL, and we acknowledge this information won't be perfect, as we'll need to make assumptions and use heuristics to figure out where to get the information from.
Because of this, to avoid retrieving huge documents and avoid parsing huge documents, we should ideally retrieve and parse the remote resource incrementally, and stop when we have enough information about it. For example, generated links will always have a maximum length, and if we are asked to generate a link for a resource storing the complete Shakespeare works, we only need to get the first 4K at most and then we are done. A lot of CPU power and network traffic can be saved this way.