Skip to content

Conversation

cbornet
Copy link
Owner

@cbornet cbornet commented Jul 24, 2024

No description provided.

@cbornet cbornet marked this pull request as draft July 24, 2024 22:23
## Working with external knowledge
- [Build a Retrieval Augmented Generation (RAG) Application](/docs/tutorials/rag)
- [Build a Conversational RAG Application](/docs/tutorials/qa_chat_history)
- [Build a Tech Support Bot from an existing Knowledge Base](/docs/tutorials/graph_vectorstore)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible titles to get RAG and links in:

Build a Tech Support RAG Application with Content Links (Content Links or Hyperlinks or something like that perhaps)?

@kerinin
Copy link

kerinin commented Jul 25, 2024

The example feels very DataStax-specific - I think we should make sure the way it's written doesn't assume the reader has ever heard of us or any of our products, for example rather than "Load the Astra Documentation" something like "Load the Documentation pages".

I also think we need to cut the length down a lot - this spends too much time doing environment setup. We could do several things to simplify (off the top of my head):

  • Hard-code URLs rather than using the sitemap
  • Rely on env vars rather than using getpass
  • Create links for all URLs regardless of prefix
  • Load documents in a single call rather than batching them

...basically look for anything that isn't explaining graph RAG directly and try to simplify it away.

@bjchambers
Copy link

Not really.

  • Hard-code URLs rather than using the sitemap

There are 4000 pages or something like that. It would take a little more than using the sitemap. It would be better if we could get the sitemap logic into LangChain. We could (perhaps) pickle the list and load that. Or, we could break it into a function so it doesn't show up in the notebook. But, it is part of showing a "real" example, and I think it's actually useful to show how to use the sitemap to crawl your own knowledge base (very re-usable).

  • Rely on env vars rather than using getpass

Sure. Won't save too much I don't think. And not typical for re-usable notebooks (harder to set if they want to run it).

  • Create links for all URLs regardless of prefix

It already does this. If you're referring to not using the CSS selectors based on the prefix, that is important to avoid the header/footer/navigation from being part of the content and flooding the links. This is important to show (although we could simplify with a HTML-to-markdown-plus-html-extractors document transformer) since it is part of real examples.

  • Load documents in a single call rather than batching them

Too many documents to load in a single call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants