Skip to content

Commit fee4eeb

Browse files
authored
Add web crawling (#4)
* Initial implementation of web-crawling. Each file found is ingested. * Leverages crawler4j dependency * Restore OpenAI interactions that rely on embedding model * Upgrades to docker-compose infra. * Add crawl function to ENDPOINTS.md and Excalidraw-based diagram * Other miscellaneous adjustments
1 parent 5ea12f9 commit fee4eeb

23 files changed

+855
-289
lines changed

build.gradle

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,12 +49,18 @@ dependencies {
4949
}
5050
implementation 'io.minio:minio:8.5.12'
5151
implementation 'com.emc.ecs:object-client:3.4.10'
52-
implementation 'io.pivotal.cfenv:java-cfenv-all:3.2.0'
52+
implementation 'io.pivotal.cfenv:java-cfenv:3.2.0'
53+
implementation 'io.pivotal.cfenv:java-cfenv-boot:3.2.0'
54+
implementation 'com.cedarsoftware:json-io:4.30.0'
5355
implementation 'org.springframework.cloud:spring-cloud-bindings:2.0.3'
5456
implementation 'org.springframework.ai:spring-ai-markdown-document-reader'
5557
implementation 'org.springframework.ai:spring-ai-tika-document-reader'
5658
implementation 'org.springframework.ai:spring-ai-pdf-document-reader'
5759
implementation 'com.fasterxml.jackson.dataformat:jackson-dataformat-xml'
60+
implementation 'io.github.springboot-addons:spring-boot-starter-httpclient5-resilience4j:1.0.5'
61+
implementation ('de.hs-heilbronn.mi:crawler4j-with-hsqldb:5.1.0') {
62+
exclude group: 'org.apache.logging.log4j', module: 'log4j-slf4j2-impl'
63+
}
5864
runtimeOnly 'io.micrometer:micrometer-registry-prometheus'
5965
developmentOnly 'org.springframework.boot:spring-boot-docker-compose'
6066
testImplementation 'org.springframework.boot:spring-boot-starter-test'

docker-compose.chroma.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ include:
33

44
services:
55
chroma:
6-
image: chromadb/chroma:0.5.5
6+
image: chromadb/chroma:0.5.15
77
environment:
88
- IS_PERSISTENT=TRUE
99
volumes:

docker-compose.minio.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
services:
22
minio:
3-
image: minio/minio:RELEASE.2024-09-22T00-33-43Z.fips
3+
image: minio/minio:RELEASE.2024-10-13T13-34-11Z.fips
44
ports:
55
- 9000:9000
66
- 9001:9001

docker-compose.observability.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ services:
77
ports:
88
- 9411:9411
99
prometheus:
10-
image: prom/prometheus:v2.54.1
10+
image: prom/prometheus:v2.55.0
1111
container_name: prometheus
1212
volumes:
1313
- "./prometheus.yml:/etc/prometheus/prometheus.yml"
@@ -16,7 +16,7 @@ services:
1616
ports:
1717
- 9090:9090
1818
grafana:
19-
image: grafana/grafana:11.2.0
19+
image: grafana/grafana:11.3.0
2020
container_name: grafana
2121
restart: unless-stopped
2222
ports:

docker-compose.redis.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ include:
33

44
services:
55
redis:
6-
image: redis/redis-stack-server:7.4.0-v0
6+
image: redis/redis-stack-server:7.4.0-v1
77
container_name: redis
88
ports:
99
- 6379:6379

docs/ENDPOINTS.md

Lines changed: 29 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
* [Endpoints](#endpoints)
44
* [Upload](#upload)
5+
* [Crawl](#crawl)
56
* [Chat](#chat)
67
* [Get Metadata](#get-metadata)
78
* [Search](#search)
@@ -11,11 +12,11 @@
1112

1213
## Endpoints
1314

14-
All endpoints below are to be prefixed with `/api/files`.
15+
All endpoints with exception to `/crawl` below are to be prefixed with `/api/files`.
1516

1617
### Upload
1718

18-
Upload a file to an S3-compliant object store's bucket
19+
Upload a file to an S3-compliant object store's bucket; in addition the contents of the file will be vectorized and persisted into a [Vector Database](https://docs.spring.io/spring-ai/reference/api/vectordbs.html) provider.
1920

2021
```python
2122
POST /upload
@@ -56,6 +57,32 @@ You will need to adjust startup arguments, e.g., you could add the following to
5657
-Dspring.servlet.multipart.max-file-size=250MB
5758
```
5859
60+
### Crawl
61+
62+
Facilitate web-crawling. Users may issue a [CrawlRequest](../src/main/java/org/cftoolsuite/domain/crawl/CrawlRequest.java) and each document found will be (a) uploaded to S3-compliant object store and (b) contents will be vectorized and persisted into a Vector Database provider.
63+
64+
```python
65+
POST /crawl
66+
```
67+
68+
**Sample interaction**
69+
70+
```bash
71+
❯ http POST :8080/crawl rootDomain="https://docs.vmware.com/en/VMware-Tanzu-Platform/SaaS/" seeds:='["https://docs.vmware.com/en/VMware-Tanzu-Platform/SaaS/create-manage-apps-tanzu-platform-k8s/"]' maxDepthOfCrawling=5
72+
HTTP/1.1 202
73+
Connection: keep-alive
74+
Content-Type: application/json
75+
Date: Thu, 24 Oct 2024 13:38:24 GMT
76+
Keep-Alive: timeout=60
77+
Transfer-Encoding: chunked
78+
79+
{
80+
"id": "1",
81+
"result": "Accepted",
82+
"storageFolder": "/tmp/crawler4j/1"
83+
}
84+
```
85+
5986
### Chat
6087
6188
Converse with an AI chatbot who is aware of all uploaded content. Ask a question, get a response.
@@ -80,7 +107,6 @@ Hermia is a key character in "A Midsummer Night's Dream," portrayed as the daugh
80107
She passionately defends her love for Lysander and is determined to be with him despite the obstacles posed by her father and societal expectations. Hermia's character embodies themes of love, rebellion, and the struggle for autonomy within the constraints of Athenian law. Her determination to follow her heart leads her to plan an escape with Lysander, showcasing her bravery and commitment to love (Act I).
81108
```
82109
83-
84110
### Get Metadata
85111
86112
Retrieve metadata for all files in an S3-compliant object store's bucket

0 commit comments

Comments
 (0)