feat(semsearch): example semantic search

Daniele Briggi · Daniele Briggi · commit 3b30bcd684c9 · 2025-07-03T11:12:56.000+02:00
diff --git a/examples/semantic_search/README.md b/examples/semantic_search/README.md
@@ -0,0 +1,49 @@
+## sqlite-vector Semantic Search Example
+
+This example in Python demonstrates how to build a semantic search engine using the [sqlite-vector](https://github.com/sqliteai/sqlite-vector) extension and a Sentence Transformer model. It allows you to index and search documents using vector similarity, powered by a local LLM embedding model.
+
+### How it works
+
+- **Embeddings**: Uses [sentence-transformers](https://huggingface.co/sentence-transformers) to generate dense vector representations (embeddings) for text. The default model is [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2), a fast, lightweight model (384 dimensions) suitable for semantic search and retrieval tasks.
+- **Vector Store and Search**: Embeddings are stored in SQLite using the [`sqlite-vector`](https://github.com/sqliteai/sqlite-vector) extension, enabling fast similarity search (cosine distance) directly in the database.
+- **Sample Data**: The `samples/` directory contains example documents you can index and search immediately.
+
+### Installation
+
+```bash
+$ python -m venv venv
+
+$ source venv/bin/activate
+
+$ pip install -r requirements.txt
+```
+
+On first use, the required model will be downloaded automatically.
+
+### Usage
+
+Use the interactive mode to keep the model in memory and run multiple queries efficiently:
+
+```bash
+python semsearch.py --repl
+
+# Index a directory of documents
+semsearch> index ./samples
+
+# Search for similar documents
+semsearch> search "neural network architectures for image recognition"
+```
+
+### Example Queries
+
+Try these queries to test semantic similarity:
+
+- "neural network architectures for image recognition"
+- "reinforcement learning in autonomous systems"
+- "explainable artificial intelligence methods"
+- "AI governance and regulatory compliance"
+- "network intrusion detection systems"
+
+**Note:**
+- Supported extension are `.md`, `.txt`, `.py`, `.js`, `.html`, `.css`, `.sql`, `.json`, `.xml`.
+- For more details, see the code in `semsearch.py` and `semantic_search.py`.
diff --git a/examples/semantic_search/requirements.txt b/examples/semantic_search/requirements.txt
@@ -0,0 +1 @@
+sentence-transformers
diff --git a/examples/semantic_search/samples/sample-1.md b/examples/semantic_search/samples/sample-1.md
@@ -0,0 +1,3 @@
+# Article 1: Deep Learning Neural Networks
+
+Deep learning utilizes artificial neural networks with multiple layers to process and learn from vast amounts of data. These networks automatically discover intricate patterns and representations without manual feature engineering. Convolutional neural networks excel at image recognition tasks, while recurrent neural networks handle sequential data like text and speech. Popular frameworks include TensorFlow, PyTorch, and Keras. Deep learning has revolutionized computer vision, natural language processing, and speech recognition applications.
diff --git a/examples/semantic_search/samples/sample-10.md b/examples/semantic_search/samples/sample-10.md
@@ -0,0 +1,3 @@
+# Article 10: Zero Trust Security Architecture
+
+Zero trust security operates on the principle of "never trust, always verify," requiring authentication and authorization for every access request regardless of location. This approach assumes breach scenarios and implements continuous verification throughout the network. Key components include identity verification, device compliance checking, least privilege access, and micro-segmentation. Zero trust frameworks help organizations protect against insider threats and advanced persistent attacks.
diff --git a/examples/semantic_search/samples/sample-11.md b/examples/semantic_search/samples/sample-11.md
@@ -0,0 +1,3 @@
+# Article 11: Incident Response and Recovery
+
+Effective incident response requires predefined procedures for detecting, containing, and recovering from security breaches. Response teams follow structured phases: preparation, identification, containment, eradication, recovery, and lessons learned. Critical activities include forensic analysis, stakeholder communication, system restoration, and process improvement. Regular tabletop exercises and response plan updates ensure organizations can quickly minimize damage and restore normal operations after security incidents.
diff --git a/examples/semantic_search/samples/sample-12.md b/examples/semantic_search/samples/sample-12.md
@@ -0,0 +1,3 @@
+# Article 12: Machine Learning for Malware Detection
+
+Machine learning enhances malware detection by analyzing file characteristics, behavioral patterns, and network communications to identify threats. Static analysis examines file properties without execution, while dynamic analysis observes runtime behavior in controlled environments. Ensemble methods combining multiple algorithms improve detection accuracy and reduce false positives. AI-powered systems can identify zero-day threats and polymorphic malware that traditional signature-based solutions miss.
diff --git a/examples/semantic_search/samples/sample-13.md b/examples/semantic_search/samples/sample-13.md
@@ -0,0 +1,3 @@
+# Article 13: Behavioral Analytics for Anomaly Detection
+
+Behavioral analytics leverages machine learning to establish baseline patterns of normal user and system behavior, flagging deviations that may indicate security threats. User and entity behavior analytics (UEBA) systems monitor login patterns, data access, and application usage to detect insider threats and compromised accounts. Machine learning models adapt to changing behavior patterns while maintaining sensitivity to subtle anomalies that human analysts might overlook.
diff --git a/examples/semantic_search/samples/sample-14.md b/examples/semantic_search/samples/sample-14.md
@@ -0,0 +1,3 @@
+# Article 14: AI-Driven Security Orchestration
+
+Security orchestration platforms integrate multiple security tools and automate incident response workflows using artificial intelligence. These systems correlate alerts from various sources, prioritize threats based on risk assessment, and execute automated remediation actions. Natural language processing helps analyze threat intelligence reports, while machine learning improves decision-making accuracy over time. Orchestration reduces response times and analyst workload while maintaining consistent security procedures.
diff --git a/examples/semantic_search/samples/sample-15.md b/examples/semantic_search/samples/sample-15.md
@@ -0,0 +1,3 @@
+# Article 15: Advanced Persistent Threats (APTs)
+
+Advanced persistent threats represent sophisticated, long-term cyberattacks typically conducted by nation-states or organized criminal groups. APTs use multiple attack vectors, maintain persistent access, and employ stealth techniques to avoid detection. Common tactics include spear-phishing, zero-day exploits, living-off-the-land techniques, and lateral movement within networks. Defense requires continuous monitoring, threat hunting, and intelligence-driven security strategies to detect and neutralize these patient adversaries.
diff --git a/examples/semantic_search/samples/sample-16.md b/examples/semantic_search/samples/sample-16.md
@@ -0,0 +1,3 @@
+# Article 16: Social Engineering Attack Vectors
+
+Social engineering exploits human psychology rather than technical vulnerabilities to gain unauthorized access to systems and information. Common techniques include phishing emails, pretexting phone calls, baiting with infected media, and physical tailgating. Attackers research targets through social media and public information to craft convincing scenarios. Defense requires security awareness training, verification procedures, and creating organizational cultures that encourage reporting suspicious communications.
diff --git a/examples/semantic_search/samples/sample-17.md b/examples/semantic_search/samples/sample-17.md
@@ -0,0 +1,3 @@
+# Article 17: Supply Chain Security Risks
+
+Supply chain attacks target third-party vendors and software dependencies to compromise multiple organizations simultaneously. Attackers may insert malicious code into legitimate software updates, compromise hardware during manufacturing, or exploit trusted vendor relationships. Notable incidents include SolarWinds and Kaseya attacks affecting thousands of organizations. Mitigation strategies include vendor risk assessment, software composition analysis, and zero-trust principles for third-party integrations.
diff --git a/examples/semantic_search/samples/sample-18.md b/examples/semantic_search/samples/sample-18.md
@@ -0,0 +1,3 @@
+# Article 18: Quantum Computing and Cryptography
+
+Quantum computing poses both opportunities and threats for cybersecurity. Quantum computers could break current cryptographic algorithms like RSA and ECC that secure internet communications and data protection. Organizations must prepare for post-quantum cryptography by implementing quantum-resistant algorithms. However, quantum technologies also enable quantum key distribution for theoretically unbreakable communication channels. The transition period requires careful planning and gradual migration strategies.
diff --git a/examples/semantic_search/samples/sample-19.md b/examples/semantic_search/samples/sample-19.md
@@ -0,0 +1,3 @@
+# Article 19: Edge Computing Security Challenges
+
+Edge computing brings data processing closer to end users and devices, improving performance but creating new security challenges. Distributed edge nodes have limited security controls compared to centralized data centers. Attack surfaces expand across numerous endpoints with varying security capabilities. Key concerns include device authentication, data encryption, secure updates, and centralized security management. Zero-trust architectures and hardware-based security become essential for edge deployments.
diff --git a/examples/semantic_search/samples/sample-2.md b/examples/semantic_search/samples/sample-2.md
@@ -0,0 +1,3 @@
+# Article 2: Natural Language Processing Fundamentals
+
+Natural language processing enables computers to understand, interpret, and generate human language. Key techniques include tokenization, part-of-speech tagging, named entity recognition, and sentiment analysis. Modern NLP leverages transformer architectures like BERT and GPT models for tasks such as language translation, text summarization, and question answering. Applications span chatbots, voice assistants, content moderation, and automated document analysis across various industries.
diff --git a/examples/semantic_search/samples/sample-20.md b/examples/semantic_search/samples/sample-20.md
@@ -0,0 +1,3 @@
+# Article 20: IoT Security Vulnerabilities
+
+Internet of Things devices often have weak security controls due to cost constraints and rapid deployment cycles. Common vulnerabilities include default passwords, unencrypted communications, lack of update mechanisms, and insufficient access controls. IoT botnets can launch massive distributed denial-of-service attacks. Security strategies include network segmentation, device lifecycle management, security-by-design principles, and regulatory compliance requirements for IoT manufacturers and deployments.
diff --git a/examples/semantic_search/samples/sample-3.md b/examples/semantic_search/samples/sample-3.md
@@ -0,0 +1,3 @@
+# Article 3: Computer Vision Applications
+
+Computer vision empowers machines to interpret and analyze visual information from images and videos. Core techniques include object detection, image classification, facial recognition, and motion tracking. Convolutional neural networks form the backbone of modern computer vision systems. Applications include autonomous vehicles, medical imaging diagnosis, quality control in manufacturing, augmented reality, and surveillance systems. Edge computing enables real-time computer vision processing on mobile devices.
diff --git a/examples/semantic_search/samples/sample-4.md b/examples/semantic_search/samples/sample-4.md
@@ -0,0 +1,3 @@
+# Article 4: Reinforcement Learning Algorithms
+
+Reinforcement learning trains agents to make optimal decisions through trial and error interactions with environments. Agents receive rewards or penalties based on their actions, gradually learning policies that maximize cumulative rewards. Q-learning and policy gradient methods are fundamental approaches. Applications include game playing (AlphaGo), robotics control, autonomous driving, recommendation systems, and financial trading algorithms. The exploration-exploitation trade-off remains a central challenge.
diff --git a/examples/semantic_search/samples/sample-5.md b/examples/semantic_search/samples/sample-5.md
@@ -0,0 +1,3 @@
+# Article 5: Supervised vs Unsupervised Learning
+
+Supervised learning uses labeled training data to predict outcomes for new inputs, including classification and regression tasks. Common algorithms include decision trees, support vector machines, and random forests. Unsupervised learning discovers hidden patterns in unlabeled data through clustering, dimensionality reduction, and association rules. Semi-supervised learning combines both approaches when labeled data is scarce. Each paradigm serves different problem types and data availability scenarios.
diff --git a/examples/semantic_search/samples/sample-6.md b/examples/semantic_search/samples/sample-6.md
@@ -0,0 +1,3 @@
+# Article 6: AI Ethics and Bias Mitigation
+
+Artificial intelligence systems can perpetuate or amplify human biases present in training data, leading to unfair outcomes across different demographic groups. Bias mitigation strategies include diverse dataset collection, algorithmic fairness constraints, and regular bias auditing. Ethical AI development requires transparency, accountability, and stakeholder involvement. Organizations must establish governance frameworks addressing privacy, consent, and algorithmic decision-making impacts on individuals and society.
diff --git a/examples/semantic_search/samples/sample-7.md b/examples/semantic_search/samples/sample-7.md
@@ -0,0 +1,3 @@
+# Article 7: Explainable AI and Interpretability
+
+Explainable AI focuses on making machine learning models more transparent and interpretable to human users. Black-box models like deep neural networks often lack interpretability, creating trust and accountability issues. Techniques include feature importance analysis, LIME (Local Interpretable Model-agnostic Explanations), and SHAP (SHapley Additive exPlanations). Interpretability is crucial for high-stakes applications like healthcare, finance, and criminal justice where decisions require justification.
diff --git a/examples/semantic_search/samples/sample-8.md b/examples/semantic_search/samples/sample-8.md
@@ -0,0 +1,3 @@
+# Article 8: AI Regulation and Compliance
+
+Governments worldwide are developing regulatory frameworks for artificial intelligence deployment and development. The European Union's AI Act categorizes AI systems by risk levels, imposing strict requirements for high-risk applications. Compliance involves documentation, risk assessment, human oversight, and algorithmic auditing. Organizations must navigate evolving regulations while maintaining innovation capabilities. Privacy laws like GDPR also impact AI data processing and automated decision-making systems.
diff --git a/examples/semantic_search/samples/sample-9.md b/examples/semantic_search/samples/sample-9.md
@@ -0,0 +1,3 @@
+# Article 9: Threat Detection and Prevention
+
+Cybersecurity threat detection employs various technologies to identify malicious activities before they cause damage. Intrusion detection systems monitor network traffic for suspicious patterns, while endpoint protection software guards individual devices. Behavioral analysis identifies anomalies in user activities that may indicate compromised accounts. Security information and event management (SIEM) platforms aggregate and analyze security logs from multiple sources to provide comprehensive threat visibility.
diff --git a/examples/semantic_search/semantic_search.py b/examples/semantic_search/semantic_search.py
diff --git a/examples/semantic_search/semsearch.py b/examples/semantic_search/semsearch.py

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# Article 1: Deep Learning Neural Networks`
	`2`	`+`
	`3`	+Deep learning utilizes artificial neural networks with multiple layers to process and learn from vast amounts of data. These networks automatically discover intricate patterns and representations without manual feature engineering. Convolutional neural networks excel at image recognition tasks, while recurrent neural networks handle sequential data like text and speech. Popular frameworks include TensorFlow, PyTorch, and Keras. Deep learning has revolutionized computer vision, natural language processing, and speech recognition applications.
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# Article 10: Zero Trust Security Architecture`
	`2`	`+`
	`3`	+Zero trust security operates on the principle of "never trust, always verify," requiring authentication and authorization for every access request regardless of location. This approach assumes breach scenarios and implements continuous verification throughout the network. Key components include identity verification, device compliance checking, least privilege access, and micro-segmentation. Zero trust frameworks help organizations protect against insider threats and advanced persistent attacks.
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# Article 11: Incident Response and Recovery`
	`2`	`+`
	`3`	+Effective incident response requires predefined procedures for detecting, containing, and recovering from security breaches. Response teams follow structured phases: preparation, identification, containment, eradication, recovery, and lessons learned. Critical activities include forensic analysis, stakeholder communication, system restoration, and process improvement. Regular tabletop exercises and response plan updates ensure organizations can quickly minimize damage and restore normal operations after security incidents.
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# Article 12: Machine Learning for Malware Detection`
	`2`	`+`
	`3`	+Machine learning enhances malware detection by analyzing file characteristics, behavioral patterns, and network communications to identify threats. Static analysis examines file properties without execution, while dynamic analysis observes runtime behavior in controlled environments. Ensemble methods combining multiple algorithms improve detection accuracy and reduce false positives. AI-powered systems can identify zero-day threats and polymorphic malware that traditional signature-based solutions miss.
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# Article 13: Behavioral Analytics for Anomaly Detection`
	`2`	`+`
	`3`	`+Behavioral analytics leverages machine learning to establish baseline patterns of normal user and system behavior, flagging deviations that may indicate security threats. User and entity behavior analytics (UEBA) systems monitor login patterns, data access, and application usage to detect insider threats and compromised accounts. Machine learning models adapt to changing behavior patterns while maintaining sensitivity to subtle anomalies that human analysts might overlook.`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# Article 14: AI-Driven Security Orchestration`
	`2`	`+`
	`3`	+Security orchestration platforms integrate multiple security tools and automate incident response workflows using artificial intelligence. These systems correlate alerts from various sources, prioritize threats based on risk assessment, and execute automated remediation actions. Natural language processing helps analyze threat intelligence reports, while machine learning improves decision-making accuracy over time. Orchestration reduces response times and analyst workload while maintaining consistent security procedures.
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# Article 15: Advanced Persistent Threats (APTs)`
	`2`	`+`
	`3`	+Advanced persistent threats represent sophisticated, long-term cyberattacks typically conducted by nation-states or organized criminal groups. APTs use multiple attack vectors, maintain persistent access, and employ stealth techniques to avoid detection. Common tactics include spear-phishing, zero-day exploits, living-off-the-land techniques, and lateral movement within networks. Defense requires continuous monitoring, threat hunting, and intelligence-driven security strategies to detect and neutralize these patient adversaries.
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# Article 16: Social Engineering Attack Vectors`
	`2`	`+`
	`3`	+Social engineering exploits human psychology rather than technical vulnerabilities to gain unauthorized access to systems and information. Common techniques include phishing emails, pretexting phone calls, baiting with infected media, and physical tailgating. Attackers research targets through social media and public information to craft convincing scenarios. Defense requires security awareness training, verification procedures, and creating organizational cultures that encourage reporting suspicious communications.