Skip to content

Commit 9cdf202

Browse files
committed
feat: new tokenization embedding and pinned posts
1 parent 70f3d45 commit 9cdf202

File tree

6 files changed

+432
-3
lines changed

6 files changed

+432
-3
lines changed

aigc/app.js

Lines changed: 22 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,8 @@ async function loadPosts() {
3131
// Try to fetch a list of posts - we'll use a fallback approach
3232
const postFiles = [
3333
'hallucination-mitigation',
34-
'intro-to-aigc'
34+
'intro-to-aigc',
35+
'tokenization-embeddings'
3536
];
3637

3738
const posts = [];
@@ -51,6 +52,7 @@ const postFiles = [
5152
date: metadata.date || 'Unknown Date',
5253
excerpt: metadata.excerpt || 'No description available',
5354
category: metadata.category || 'general',
55+
pinned: metadata.pinned === 'true' || metadata.pinned === true,
5456
filename: filename,
5557
link: `./posts/${filename}/`
5658
});
@@ -60,6 +62,19 @@ const postFiles = [
6062
}
6163
}
6264

65+
// Sort posts: pinned first (by date), then unpinned (by date)
66+
posts.sort((a, b) => {
67+
// If one is pinned and one isn't, pinned comes first
68+
if (a.pinned !== b.pinned) {
69+
return a.pinned ? -1 : 1;
70+
}
71+
72+
// Both pinned or both unpinned: sort by date (newest first)
73+
const dateA = new Date(a.date);
74+
const dateB = new Date(b.date);
75+
return dateB - dateA;
76+
});
77+
6378
return posts;
6479
} catch (err) {
6580
console.error('Error loading posts:', err);
@@ -83,10 +98,14 @@ function formatDate(dateStr) {
8398
}
8499

85100
function createPostCard(post) {
101+
const pinnedBadge = post.pinned ? '<span class="text-xs text-[#ff00ff] bg-[#ff00ff]/10 px-3 py-1 rounded-full font-bold ml-2">📌 PINNED</span>' : '';
86102
return `
87-
<article class="bg-gray-900/80 p-6 rounded-lg border-l-4 border-[#00ff88] hover:border-[#00ccff] shadow-lg shadow-blue-500/10 transition-all duration-300 transform hover:-translate-y-1">
103+
<article class="bg-gray-900/80 p-6 rounded-lg border-l-4 ${post.pinned ? 'border-[#ff00ff]' : 'border-[#00ff88]'} hover:border-[#00ccff] shadow-lg shadow-blue-500/10 transition-all duration-300 transform hover:-translate-y-1">
88104
<div class="flex justify-between items-start mb-3">
89-
<span class="text-xs text-[#00ccff] bg-[#00ccff]/10 px-3 py-1 rounded-full font-bold">${post.category}</span>
105+
<div class="flex gap-2">
106+
<span class="text-xs text-[#00ccff] bg-[#00ccff]/10 px-3 py-1 rounded-full font-bold">${post.category}</span>
107+
${pinnedBadge}
108+
</div>
90109
<time class="text-sm text-gray-500 font-mono">${formatDate(post.date)}</time>
91110
</div>
92111
<h2 class="text-xl font-semibold mb-3 text-[#f1c40f] hover:text-[#00ff88] transition">

aigc/posts/intro-to-aigc.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ title: "Introduction to AI Generated Content (AIGC) Section"
33
date: "2026-01-15"
44
category: "general"
55
excerpt: "To Log Random Musings with AI"
6+
pinned: true
67
---
78

89
## Why I Started This
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
---
2+
title: "Tokenization & Embeddings"
3+
date: "2026-01-18"
4+
category: "tokenization-embeddings"
5+
excerpt: "From Words to Vectors: Tokenization and Embeddings"
6+
---
7+
8+
## From Words to Vectors: Tokenization and Embeddings
9+
10+
If LLMs are the "engine" of modern AI, then **Tokenization** and **Embeddings** are the fuel. Before a model can reason, summarize, or code, it must first translate human language into a language it understands: **high-dimensional mathematics.**
11+
12+
Understanding this bridge is crucial for anyone building AI agents, optimizing RAG pipelines, or managing API costs.
13+
14+
## 1. Tokenization: Breaking Language into Bricks
15+
16+
Tokenization is the process of chopping a string of text into smaller units called **tokens**. Think of tokens as the "atomic units" of processing.
17+
18+
### **The Three Levels of Tokenization**
19+
20+
1. **Word-level:** Splitting by spaces. (Simple, but fails on "running" vs "runner").
21+
2. **Character-level:** Splitting every letter. (Too granular; the model loses context).
22+
3. **Subword-level (The Standard):** Models like GPT-4 use **Byte Pair Encoding (BPE)**. It breaks common words into one token (e.g., "apple") but splits rare words into pieces (e.g., "hallucination" becomes "hallucin" + "ation").
23+
24+
### **Why It Matters:**
25+
26+
* **The 75% Rule:** In English, 1,000 tokens are roughly equivalent to 750 words.
27+
* **Context Windows:** Models have a "memory limit" (e.g., 128k tokens). If your tokenizer is inefficient, you hit that limit faster.
28+
* **Cost:** You are billed by the token. Understanding how your text tokenizes helps you estimate spend and optimize prompts.
29+
30+
## 2. Embeddings: Giving Words a "Map"
31+
32+
Once we have tokens, the model assigns each one a unique ID. But a list of IDs (e.g., `45, 102, 33`) doesn't tell the model that "dog" is related to "puppy."
33+
34+
This is where **Embeddings** come in. An embedding is a numerical representation of a token in a high-dimensional vector space.
35+
36+
### **The Semantic Space**
37+
38+
Imagine a 3D map where words with similar meanings are physically close to each other.
39+
40+
* "Apple" and "Banana" are close together.
41+
* "Apple" and "Laptop" are slightly further apart (unless discussing tech).
42+
* "Apple" and "Justice" are very far apart.
43+
44+
In reality, modern embeddings don't use 3 dimensions—they use **thousands** (e.g., 1,536 dimensions for OpenAI’s `text-embedding-3-small`). Each dimension represents a "feature" of the word that the model learned during training.
45+
46+
### **The Magic of Vector Math**
47+
48+
Because these are numbers, we can perform math on them. The classic example:
49+
50+
`Vector("King") − Vector("Man") + Vector("Woman") ≈ Vector("Queen")`
51+
52+
## 3. How They Work Together: The Pipeline
53+
54+
Here is the journey of a user query through an AI system:
55+
56+
1. **Input:** "How do I fix a leaky faucet?"
57+
2. **Tokenization:** The string is split into tokens: `["How", " do", " I", " fix", " a", " leaky", " fauc", "et", "?"]`.
58+
3. **Lookup:** The model looks up the **Embedding** for each token.
59+
4. **Attention Layer:** The model looks at the vectors and realizes "leaky" is modifying "faucet," creating a combined understanding of the query.
60+
5. **Output:** The model generates the next most likely token vector and turns it back into a word.
61+
62+
## 4. Practical Implementation: When to Care
63+
64+
If you are an AI architect, you will encounter these concepts in two main areas:
65+
66+
### **A. Choosing an Embedding Model**
67+
68+
Not all embeddings are equal. You need to balance **Performance vs. Latency**.
69+
70+
* **Proprietary (OpenAI/Gemini):** Extremely high performance, but you pay per request and data leaves your server.
71+
* **Open Source (Qwen3/Gemma):** Can be hosted locally (good for privacy), but requires your own GPU infrastructure.
72+
73+
### **B. Vector Databases (The RAG Connection)**
74+
75+
When you build a Knowledge Base, you are essentially storing thousands of embeddings.
76+
77+
* **The Process:** You "embed" your entire document library.
78+
* **The Search:** When a user asks a question, you embed the *question* and find the document vectors that are physically closest to it in the vector space. This is called **Cosine Similarity** as an example, there are many other similarity measures beyond this.
79+
80+
## Conclusion: The Math of Meaning
81+
82+
Tokenization and embeddings are why AI feels "human." By turning language into a spatial map, we allow machines to understand nuances, synonyms, and relationships that traditional keyword search could never touch.
83+
84+
If you are building an agent, remember: **Better embeddings lead to better retrieval, and better tokenization leads to better efficiency.**
Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
// Parse YAML frontmatter from markdown
2+
function parseFrontmatter(content) {
3+
const frontmatterRegex = /^---\n([\s\S]*?)\n---/;
4+
const match = content.match(frontmatterRegex);
5+
6+
if (!match) {
7+
return { metadata: {}, content: content };
8+
}
9+
10+
const frontmatterStr = match[1];
11+
const metadata = {};
12+
13+
// Simple YAML parser
14+
frontmatterStr.split('\n').forEach(line => {
15+
const [key, ...valueParts] = line.split(':');
16+
if (key && valueParts.length > 0) {
17+
let value = valueParts.join(':').trim();
18+
value = value.replace(/^["']|["']$/g, '');
19+
metadata[key.trim()] = value;
20+
}
21+
});
22+
23+
const bodyContent = content.replace(frontmatterRegex, '').trim();
24+
return { metadata, content: bodyContent };
25+
}
26+
27+
// Get the post filename from URL
28+
function getPostFilename() {
29+
const path = window.location.pathname;
30+
const parts = path.split('/');
31+
// Should be something like /aigc/posts/post-name/
32+
for (let i = 0; i < parts.length; i++) {
33+
if (parts[i] === 'posts' && i + 1 < parts.length) {
34+
return parts[i + 1];
35+
}
36+
}
37+
return null;
38+
}
39+
40+
// Format date
41+
function formatDate(dateStr) {
42+
try {
43+
const [year, month, day] = dateStr.split('-');
44+
const date = new Date(year, month - 1, day);
45+
return date.toLocaleDateString('en-US', { day: 'numeric', month: 'short', year: 'numeric' });
46+
} catch (e) {
47+
return dateStr;
48+
}
49+
}
50+
51+
// Load and render the post
52+
async function loadPost() {
53+
const filename = getPostFilename();
54+
if (!filename) {
55+
document.getElementById('post-container').innerHTML = '<p>Post not found</p>';
56+
return;
57+
}
58+
59+
console.log('Loading post:', filename);
60+
61+
try {
62+
// Fetch the markdown file from the posts directory (one level up from current post dir)
63+
const response = await fetch(`/aigc/posts/${filename}.md`);
64+
65+
if (!response.ok) {
66+
console.error('Failed to fetch markdown:', response.status, response.statusText);
67+
document.getElementById('post-container').innerHTML = `<p>Could not load post (${response.status})</p>`;
68+
return;
69+
}
70+
71+
const markdownContent = await response.text();
72+
console.log('Markdown loaded, length:', markdownContent.length);
73+
74+
const { metadata, content } = parseFrontmatter(markdownContent);
75+
console.log('Metadata:', metadata);
76+
77+
// Update page title
78+
if (metadata.title) {
79+
document.title = `ritchie@singapore~$ ${metadata.title}`;
80+
}
81+
82+
// Convert markdown to HTML - ensure marked is available
83+
let htmlContent;
84+
console.log('marked available:', typeof marked !== 'undefined');
85+
86+
if (typeof marked !== 'undefined' && marked.parse) {
87+
try {
88+
htmlContent = marked.parse(content);
89+
console.log('Markdown parsed successfully, HTML length:', htmlContent.length);
90+
} catch (e) {
91+
console.error('Error parsing markdown:', e);
92+
htmlContent = `<pre>${content}</pre>`;
93+
}
94+
} else {
95+
console.warn('marked.js not loaded, showing raw markdown');
96+
htmlContent = `<pre>${content}</pre>`;
97+
}
98+
99+
// Build the post header
100+
const headerHTML = `
101+
<div class="post-header">
102+
<h1 class="post-title">${metadata.title || 'Untitled'}</h1>
103+
<div class="post-meta">
104+
<div class="post-meta-item">
105+
<span class="post-meta-label">Published:</span> ${formatDate(metadata.date || 'Unknown')}
106+
</div>
107+
<div class="post-meta-item">
108+
<span class="post-meta-label">Category:</span> <span style="color: #00ff88;">${metadata.category || 'AIGC'}</span>
109+
</div>
110+
</div>
111+
</div>
112+
`;
113+
114+
// Insert the header and content
115+
const container = document.getElementById('post-container');
116+
container.innerHTML = headerHTML + htmlContent;
117+
118+
// Re-render MathJax if it's loaded
119+
if (typeof MathJax !== 'undefined' && MathJax.typesetPromise) {
120+
MathJax.typesetPromise([container]).catch(err => console.log('MathJax error:', err));
121+
}
122+
123+
} catch (error) {
124+
console.error('Error loading post:', error);
125+
document.getElementById('post-container').innerHTML = `<p>Error loading post: ${error.message}</p>`;
126+
}
127+
}
128+
129+
document.addEventListener('DOMContentLoaded', loadPost);

0 commit comments

Comments
 (0)