Skip to content

Commit b4f7f3f

Browse files
committed
initial publish
1 parent 7ee1b04 commit b4f7f3f

File tree

678 files changed

+80733
-44
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

678 files changed

+80733
-44
lines changed

.gitignore

Lines changed: 88 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,10 @@
11
## Ignore Visual Studio temporary files, build results, and
22
## files generated by popular Visual Studio add-ons.
33
##
4-
## Get latest from https://github.com/github/gitignore/blob/main/VisualStudio.gitignore
4+
## Get latest from `dotnet new gitignore`
5+
6+
# dotenv files
7+
.env
58

69
# User-specific files
710
*.rsuser
@@ -57,11 +60,14 @@ dlldata.c
5760
# Benchmark Results
5861
BenchmarkDotNet.Artifacts/
5962

60-
# .NET Core
63+
# .NET
6164
project.lock.json
6265
project.fragment.lock.json
6366
artifacts/
6467

68+
# Tye
69+
.tye/
70+
6571
# ASP.NET Scaffolding
6672
ScaffoldingReadMe.txt
6773

@@ -396,3 +402,83 @@ FodyWeavers.xsd
396402

397403
# JetBrains Rider
398404
*.sln.iml
405+
.idea/
406+
407+
##
408+
## Visual studio for Mac
409+
##
410+
411+
412+
# globs
413+
Makefile.in
414+
*.userprefs
415+
*.usertasks
416+
config.make
417+
config.status
418+
aclocal.m4
419+
install-sh
420+
autom4te.cache/
421+
*.tar.gz
422+
tarballs/
423+
test-results/
424+
425+
# Mac bundle stuff
426+
*.dmg
427+
*.app
428+
429+
# content below from: https://github.com/github/gitignore/blob/main/Global/macOS.gitignore
430+
# General
431+
.DS_Store
432+
.AppleDouble
433+
.LSOverride
434+
435+
# Icon must end with two \r
436+
Icon
437+
438+
439+
# Thumbnails
440+
._*
441+
442+
# Files that might appear in the root of a volume
443+
.DocumentRevisions-V100
444+
.fseventsd
445+
.Spotlight-V100
446+
.TemporaryItems
447+
.Trashes
448+
.VolumeIcon.icns
449+
.com.apple.timemachine.donotpresent
450+
451+
# Directories potentially created on remote AFP share
452+
.AppleDB
453+
.AppleDesktop
454+
Network Trash Folder
455+
Temporary Items
456+
.apdisk
457+
458+
# content below from: https://github.com/github/gitignore/blob/main/Global/Windows.gitignore
459+
# Windows thumbnail cache files
460+
Thumbs.db
461+
ehthumbs.db
462+
ehthumbs_vista.db
463+
464+
# Dump file
465+
*.stackdump
466+
467+
# Folder config file
468+
[Dd]esktop.ini
469+
470+
# Recycle Bin used on file shares
471+
$RECYCLE.BIN/
472+
473+
# Windows Installer files
474+
*.cab
475+
*.msi
476+
*.msix
477+
*.msm
478+
*.msp
479+
480+
# Windows shortcuts
481+
*.lnk
482+
483+
# Vim temporary swap files
484+
*.swp

App/Data_Processing.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
## Content Processing
2+
Additional details about how content processing is handled in the solution. This includes the workflow steps and how to use your own data in the solution.
3+
4+
### Workflow
5+
6+
1. <u>Document upload</u><br/>
7+
Documents added to blob storage. Processing is triggered based on file check-in.
8+
9+
2. <u>Text extraction, context extraction (image)</u><br/>
10+
Based on file type, an appropriate processing pipeline is used
11+
12+
3. <u>Summarization</u><br/>
13+
LLM summarization of the extracted content.
14+
15+
4. <u>Keyword and entity extraction</u><br/>
16+
Keywords extracted from full document through an LLM prompt. If document is too large, keywords are extracted from the summarization.
17+
18+
5. <u>Text chunking from text extraction results</u><br/>
19+
Chunking size is aligned with the embedding model size.
20+
21+
6. <u>Vectorization</u><br/>
22+
Creation of embeddings from chunked text using text-embedding-3-large model.
23+
24+
7. <u>Save results to Azure AI Search index</u><br/>
25+
Data added to index including vectorized fields, text chunks, keywords, entity specific meta data.
26+
27+
### Customizing With Your Own Documents
28+
29+
There are two methods to use your own data in this solution. It takes roughly 10-15 minutes for a file to be processed and show up in the index and in results on the web app.
30+
31+
1. <u>Web App - UI Uploading</u><br/>
32+
You can upload through the user interface files that you would like processed. These files are uploaded to blob storage, processed, and added to the Azure AI Search index. File uploads are limited to 500MB and restricted to the following file formats: Office Files, TXT, PDF, TIFF, JPG, PNG.
33+
34+
2. <u>Bulk File Processing</u><br/>
35+
You can take buik file processing since the web app saves uploaded files here also. This would be the ideal to upload a large number of document or files that are large in size.
36+
37+
### Modifying Processing Prompts
38+
39+
Prompt based processing is used for context extraction, summarization, and keyword/entity extraction. Modifications to the prompts will change what is extracted for the related workflow step.
40+
41+
You can find the prompt configuration text files for **summarization** and **keyword/entity** extraction in this folder:
42+
```
43+
\App\kernel-memory\service\Core\Prompts\
44+
```
45+
46+
**Context extraction** requires a code re-compile. You can modify the prompt in this code file on <u>line 56</ul>:
47+
48+
```
49+
\App\kernel-memory\service\Core\DataFormats\Image\ImageContextDecoder.cs
50+
```

App/Technical_Architecture.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
## Technical Architecture
2+
3+
Additional details about the technical architecture of the Document Knowledge Mining solution accelerator. This describes the purpose and additional context of each component in the solution.
4+
5+
![image](../images/readme/architecture.png)
6+
7+
8+
### Ingress Controller
9+
Using Azure's Application Gateway Ingress Controller for Kubernetes. Allowing for load balancing and dynamic traffic management across the application layer.
10+
11+
### Azure Kubernetes
12+
Using Azure Kubernetes Service, the application is deployed as a managed containerized app. This is ideal for deploying a high availability, scalable, and portable application to multiple regions.
13+
14+
### Container Registry
15+
Using Azure Container Registry, container images are built, stored, and managed in a private registry. These container images include the Document Processor, AI Service, and Web App.
16+
17+
### Web App
18+
Using Azure App Service, a web app acts as the UI for the solutions. The app is built with React and TypeScript. it acts as an API client to create an experience for document search, an easy to use upload and processing interface, and an LLM powered conversational user interface.
19+
20+
### Service - Document Processor
21+
Internal kubernetes cluster for document processing pods.
22+
23+
### Document Processor Pods
24+
API end points to facilitate processing of documents that are stored in blob storage. Azure Kubernetes Pod that handles saving document chunks, vectors, and keywords to Azure AI Search and blob storage. It extracts content and context from images in order to derive knowledge, keywords, topics, and summarizations. Based on the file type, different processing pipelines are run to extract the data in the appropriate steps.
25+
26+
### Service - AI Service
27+
Internal kubernetes cluster for AI service pods.
28+
29+
### AI Service Processor Pods
30+
Azure Kubernetes Pod that acts as the solution's orchestration layer (with Semantic Kernel) for interaction with the LLM for the web app. This also includes chat end points to (syncronous and asyncrounous) to stream chat coversations on the web app and to save chat history. This facilitates saving document meta data, keywords and summarizatinons to Cosmos DB to show them through the web app's user interface.
31+
32+
### App Configuration
33+
Using Azure App Configuration, app settings and configurations are centralized and used with the Document Processor Service, AI Service, and Web App.
34+
35+
### Storage Queue
36+
Using Azure Storage Queue, pipeline work steps and processing jobs are added to the storage queue to be picked up and run for their respective jobs. Files uploaded are queue while being saved the blob storage and removed after successful completion.
37+
38+
### Azure AI Search
39+
Processed and extracted document information is added to an Azure AI Search vecortized index. This vectorized index includes columns relevant to the document set and is integrated with the web app to power the document search and document chatting experience.
40+
41+
### Azure Document Intelligence
42+
One step of the data processing workflow where documents have Optical Character Recognition (OCR) applied to extract data. This includes text and handwriting extraction from documents.
43+
44+
### GPT 4o mini
45+
Using Azure OpenAI, a deployment of the GPT 4o mini model (version 2024-07-18) is used during the data processing workflow to extract content, context, keywords, knowledge, topics and summarization. This model is also used in the web app's chat experience. This model can be changed to a different Azure OpenAI model if desired, but this has not been thoroughly tested and may be affected by the output token limits.
46+
47+
### Blob Storage
48+
Using Azure Blog Storage, unprocessed document are stored as blobs. The data processing workflow reads the file and saves a JSON, text chunks, markdown, embedded text, and meta data including keywords and sumamrization of the processed data back to blob storage. Files uploaded through the web app's upload capabilities are uploaded here.
49+
50+
51+
### Cosmos DB for MongoDB
52+
Using Azure Cosmos DB for MongoDB, documents that have been processed have their processing results saved to a table. The web app chat experience saves chat history to a table. The processed document results and chat history are used to inform prompt recommendations and answers.

App/backend-api/.dockerignore

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
**/.classpath
2+
**/.dockerignore
3+
**/.env
4+
**/.git
5+
**/.gitignore
6+
**/.project
7+
**/.settings
8+
**/.toolstarget
9+
**/.vs
10+
**/.vscode
11+
**/*.*proj.user
12+
**/*.dbmdl
13+
**/*.jfm
14+
**/azds.yaml
15+
**/bin
16+
**/charts
17+
**/docker-compose*
18+
**/Dockerfile*
19+
**/node_modules
20+
**/npm-debug.log
21+
**/obj
22+
**/secrets.dev.yaml
23+
**/values.dev.yaml
24+
LICENSE
25+
README.md
26+
!**/.gitignore
27+
!.git/HEAD
28+
!.git/config
29+
!.git/packed-refs
30+
!.git/refs/heads/**

0 commit comments

Comments
 (0)