Replies: 3 comments 4 replies
-
I've noticed the same thing. It is taking more than 72 hours for a 1.6MB text file. For reference, this is my config.yaml The prompts are default ones with entity prompt tweaked a little bit. But token size remains the same ### This config file contains required core defaults that must be set, along with a handful of common optional settings.
### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/
### LLM settings ###
## There are a number of settings to tune the threading and token limits for LLM calls - check the docs.
models:
default_chat_model:
type: openai_chat # or azure_openai_chat
# api_base: https://<instance>.openai.azure.com
# api_version: 2024-05-01-preview
auth_type: api_key # or azure_managed_identity
api_base: ${OPENAI_BASE_URL}
api_key: ${GRAPHRAG_API_KEY} # set this in the generated .env file
# audience: "https://cognitiveservices.azure.com/.default"
# organization: <organization_id>
model: ${GRAPHRAG_CHAT_MODEL}
# deployment_name: <azure_model_deployment_name>
# encoding_model: cl100k_base # automatically set by tiktoken if left undefined
model_supports_json: false # recommended if this is available for your model.
concurrent_requests: 25 # max number of simultaneous LLM requests allowed
async_mode: threaded # or asyncio
retry_strategy: native
max_retries: 10
tokens_per_minute: auto # set to null to disable rate limiting
requests_per_minute: auto # set to null to disable rate limiting
default_embedding_model:
type: openai_embedding # or azure_openai_embedding
# api_base: https://<instance>.openai.azure.com
# api_version: 2024-05-01-preview
auth_type: api_key # or azure_managed_identity
api_base: ${OPENAI_BASE_URL}
api_key: ${GRAPHRAG_API_KEY}
# audience: "https://cognitiveservices.azure.com/.default"
# organization: <organization_id>
model: ${GRAPHRAG_EMBEDDINGS_MODEL}
# deployment_name: <azure_model_deployment_name>
# encoding_model: cl100k_base # automatically set by tiktoken if left undefined
model_supports_json: false # recommended if this is available for your model.
concurrent_requests: 25 # max number of simultaneous LLM requests allowed
async_mode: threaded # or asyncio
retry_strategy: native
max_retries: 10
tokens_per_minute: auto # set to null to disable rate limiting
requests_per_minute: auto # set to null to disable rate limiting
### Input settings ###
input:
type: file # or blob
file_type: text # [csv, text, json]
base_dir: "input"
chunks:
size: 1200
overlap: 500
group_by_columns: [id]
### Output/storage settings ###
## If blob storage is specified in the following four sections,
## connection_string and container_name must be provided
output:
type: file # [file, blob, cosmosdb]
base_dir: "output"
cache:
type: file # [file, blob, cosmosdb]
base_dir: "cache"
reporting:
type: file # [file, blob, cosmosdb]
base_dir: "logs"
vector_store:
default_vector_store:
type: lancedb
db_uri: output/lancedb
container_name: default
overwrite: True
### Workflow settings ###
embed_text:
model_id: default_embedding_model
vector_store_id: default_vector_store
extract_graph:
model_id: default_chat_model
prompt: "prompts/extract_graph_v3.txt"
entity_types: [COMPANY, SUBSIDIARY, CURRENCY, FISCAL_YEAR, FISCAL_QUARTER, METRIC_CATEGORY, METRIC_VALUE, PRODUCT/SERVICE, RISK, GEOGRAPHY, PERSON, TITLES_EXTENDED, CORPORATE_ACTIONS, COMPANY_EVENTS]
max_gleanings: 1
summarize_descriptions:
model_id: default_chat_model
prompt: "prompts/summarize_descriptions.txt"
max_length: 500
extract_graph_nlp:
text_analyzer:
extractor_type: regex_english # [regex_english, syntactic_parser, cfg]
cluster_graph:
max_cluster_size: 10
extract_claims:
enabled: false
model_id: default_chat_model
prompt: "prompts/extract_claims.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 1
community_reports:
model_id: default_chat_model
graph_prompt: "prompts/community_report_graph.txt"
text_prompt: "prompts/community_report_text.txt"
max_length: 2000
max_input_length: 8000
embed_graph:
enabled: true # if true, will generate node2vec embeddings for nodes
umap:
enabled: false # if true, will generate UMAP embeddings for nodes (embed_graph must also be enabled)
snapshots:
graphml: true
embeddings: true
### Query settings ###
## The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query
local_search:
chat_model_id: default_chat_model
embedding_model_id: default_embedding_model
prompt: "prompts/local_search_system_prompt.txt"
global_search:
chat_model_id: default_chat_model
map_prompt: "prompts/global_search_map_system_prompt.txt"
reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"
drift_search:
chat_model_id: default_chat_model
embedding_model_id: default_embedding_model
prompt: "prompts/drift_search_system_prompt.txt"
reduce_prompt: "prompts/drift_search_reduce_prompt.txt"
basic_search:
chat_model_id: default_chat_model
embedding_model_id: default_embedding_model
prompt: "prompts/basic_search_system_prompt.txt" |
Beta Was this translation helpful? Give feedback.
-
Hi @fenglex, Thanks for raising this — 10+ hours for indexing 19 articles definitely seems excessive. A few things to consider: DeepSeek-R1 via Alibaba Cloud might have rate limits, token quotas, or cold start latency depending on your plan. If you're using a retrieval-based pipeline (like RAG), the slowness might come from embedding generation or vector indexing (e.g., FAISS, Milvus). 🔧 Suggestions: See if you can batch the requests (if the API supports it). Consider running the process with logging enabled to identify the exact bottleneck (embedding, indexing, or disk I/O). Also, check if the content of the articles is unusually large or contains formatting that might slow tokenization. If you can share your code snippet or indexing setup, I’d be happy to take a closer look! Best, |
Beta Was this translation helpful? Give feedback.
-
Hi everyone, I’ve implemented several changes that have improved indexing performance by approximately 50%.
cache:
type: none # Supported: file, blob, cosmosdb
# Original implementation:
# files = list(storage.find(re.compile(config.file_pattern), progress=progress, file_filter=config.file_filter))
# Example: config.file_pattern = oid/file-name$
# Optimized implementation:
files = [(config.file_pattern.split("/")[-1].split("$")[0], {})] Impact: |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
The process of running the index took too long. I used 19 articles, and it still hadn't completed after 10 hours, using Alibaba Cloud's API (deepseek-r1).
Beta Was this translation helpful? Give feedback.
All reactions