Build-time dynamic collections - joins, computed fields, etc - Feedback appreciated #3487
Replies: 2 comments
-
Just progressing through the various build steps with this: When you rebuild the collection from a cache, the cache is invalidated because getKeys return nothing (the afterParse events are not fired again, and the source's build data will be empty.). This lets the initial build work fine, but subsequent builds which use content from the cache will be deleted when updated. I'm not sure if this is anti-pattern, but we can persist the collection data, and then when we re-load the collection, we can load that file. Our BuildData becomes: export class BuildData {
private _collections = new Map<string, BuiltCollection>()
private _hasSeenSomething = false
public get collections(): ReadonlyMap<string, BuiltCollection> {
return this._collections
}
public get hasSeenSomething(): boolean {
return this._hasSeenSomething
}
public seen(hook: FileAfterParseHook): void {
this._hasSeenSomething = true
const { name } = hook.collection
const entry = this.collections.get(name)
?? new BuiltCollectionImpl({}, hook.collection)
entry.data[hook.file.id] = hook
this._collections.set(name, entry)
}
public pick(names: readonly string[]): BuiltCollection[] {
return names.flatMap(n => this.collections.get(n) ?? [])
}
public all(): BuiltCollection[] {
return Array.from(this.collections.values())
}
public get(name: string): BuiltCollection | undefined {
return this.collections.get(name)
}
} Our dynamic source becomes something like this: async function prepare({ rootDir }) {
const cacheDir = ufo.joinURL(
rootDir,
'.nuxt',
'content-cache'
)
const cacheFile = ufo.joinURL(
cacheDir,
`${opts.collections.join('_')}.json`
)
if (build.hasSeenSomething) {
processed = await opts.process(build)
await fs.promises.mkdir(cacheDir, { recursive: true })
await fs.promises.writeFile(cacheFile, JSON.stringify(processed, null, 2), 'utf8')
} else {
processed = JSON.parse(
await fs.promises.readFile(cacheFile, 'utf8')
) as ContentFile[]
}
}
async function getKeys() {
return processed.map(f => f.path)
}
async function getItem(key: string): Promise<string> {
const data = processed.find(
f => f.path === key
)?.body ?? ''
return data
}
return defineCollectionSource({
prepare,
getKeys,
getItem
})
} which lets the content be loaded from cache and keeps the database populated. If any of the underlying content is regenerated, (eg, hasSeemSomething is set to true), it will re-process the content. The full source: import { defineNuxtModule, useNuxt } from '@nuxt/kit'
import { defineCollectionSource } from '@nuxt/content'
import ufo from 'ufo'
import fs from 'node:fs'
import type {
FileAfterParseHook,
ResolvedCollection,
ContentFile,
ResolvedCustomCollectionSource,
ParsedContentFile
} from '@nuxt/content'
import * as aq from 'arquero'
import type { ColumnTable } from 'arquero'
import { v4 as generate_uuid } from 'uuid'
export interface BuiltCollection {
data: Record<string, FileAfterParseHook>
collection: ResolvedCollection
}
class BuiltCollectionImpl implements BuiltCollection {
constructor(
public data: Record<string, FileAfterParseHook> = {},
public collection?: ResolvedCollection
) {}
}
export class BuildData {
private _collections = new Map<string, BuiltCollection>()
private _hasSeenSomething = false
public get collections(): ReadonlyMap<string, BuiltCollection> {
return this._collections
}
public get hasSeenSomething(): boolean {
return this._hasSeenSomething
}
public seen(hook: FileAfterParseHook): void {
this._hasSeenSomething = true
const { name } = hook.collection
const entry = this.collections.get(name)
?? new BuiltCollectionImpl({}, hook.collection)
entry.data[hook.file.id] = hook
this._collections.set(name, entry)
}
public pick(names: readonly string[]): BuiltCollection[] {
return names.flatMap(n => this.collections.get(n) ?? [])
}
public all(): BuiltCollection[] {
return Array.from(this.collections.values())
}
public get(name: string): BuiltCollection | undefined {
return this.collections.get(name)
}
}
export interface DynamicCollectionOptions {
collections: string[]
process(data: BuildData): Promise<ContentFile[]>
serialize?(file: ContentFile): Promise<string>
}
export interface LeftJoinOptions {
leftCollection: string
leftField: string
collections: BuiltCollection[]
rightField:
| string
| Record<string, string>
outputDir?: string
onMatch?: (match: object) => ContentFile
}
const explodeRows = (rows: object[], key: string): object[] =>
rows.flatMap((r) => {
const v = r[key]
return Array.isArray(v)
? v.map(val => ({
...r,
[key]: val
}))
: [r]
})
function prefixFieldNames(builtCollection: BuiltCollection) {
return (t) => {
const renameMap: Record<string, string> = {}
for (const c of t.columnNames()) {
renameMap[c] = `${builtCollection.collection.name}_${c}`
}
return renameMap
}
}
export function leftJoin(options: LeftJoinOptions): ColumnTable | null {
const leftColl = options.collections.find(c => c.collection?.name === options.leftCollection)
if (!leftColl) return null
const leftRows = Object.values(leftColl.data)
.map(hook => ({
__collection: options.leftCollection,
__id: hook.file.id,
...(hook.content as ParsedContentFile),
...hook.file
}))
let joined = aq
.from(explodeRows(leftRows, options.leftField))
.rename(prefixFieldNames(leftColl))
/* ----------------------------- join rights -------------------------------- */
for (const builtCollection of options.collections) {
if (builtCollection.collection.name === options.leftCollection) continue
// determine the RHS key to use for this collection
const rhsKey
= typeof options.rightField === 'string'
? options.rightField
: options.rightField[builtCollection.collection.name]
if (!rhsKey) continue
const rightRows = Object.values(builtCollection.data)
.map(hook => ({
__collection: builtCollection.collection.name,
__id: hook.file.id,
...(hook.content as ParsedContentFile),
...hook.file
}))
// Create the right table, and prefix everything with the collection name
const right = aq
.from(explodeRows(rightRows, rhsKey))
.rename(prefixFieldNames(builtCollection))
// left‑join on the requested keys
joined = joined.join_left(right, [
`${leftColl.collection.name}_${options.leftField}`,
`${builtCollection.collection.name}_${rhsKey}`
])
}
return joined
}
export interface GenerateVirtualFilesOptions {
outputDir?: string
filename?: (row: Record<string, unknown>) => string
transform?: (row: Record<string, unknown>) => unknown
}
export function generateVirtualFiles(
table: ColumnTable,
options?: GenerateVirtualFilesOptions
): ContentFile[] {
const mergedOptions = {
outputDir: '/virtual',
filename: (row: any) => row.__id ?? generate_uuid(),
transform: (r: any) => r,
...options ?? {}
}
const results = table.objects().map((row) => {
const data = mergedOptions.transform ? mergedOptions.transform(row) : row
const filePath = ufo.joinURL(
'/',
mergedOptions.outputDir,
`${mergedOptions.filename(row)}.json`
)
return <ContentFile>{
path: filePath,
body: JSON.stringify(data, null, 2),
extension: '.json'
}
})
return results
}
export function defineDynamicSource(
opts: DynamicCollectionOptions
): ResolvedCustomCollectionSource {
const nuxt = useNuxt()
const build = new BuildData()
const wanted = new Set(opts.collections)
let processed: ContentFile[] = []
// capture content as it’s parsed
nuxt.hook('content:file:afterParse', (hook) => {
if (wanted.has(hook.collection.name)) {
build.seen(hook)
}
})
async function prepare({ rootDir }) {
const cacheDir = ufo.joinURL(
rootDir,
'.nuxt',
'content-cache'
)
const cacheFile = ufo.joinURL(
cacheDir,
`${opts.collections.join('_')}.json`
)
if (build.hasSeenSomething) {
processed = await opts.process(build)
await fs.promises.mkdir(cacheDir, { recursive: true })
await fs.promises.writeFile(cacheFile, JSON.stringify(processed, null, 2), 'utf8')
} else {
processed = JSON.parse(
await fs.promises.readFile(cacheFile, 'utf8')
) as ContentFile[]
}
}
async function getKeys() {
return processed.map(f => f.path)
}
async function getItem(key: string): Promise<string> {
const data = processed.find(
f => f.path === key
)?.body ?? ''
return data
}
return defineCollectionSource({
prepare,
getKeys,
getItem
})
}
export default defineNuxtModule({
meta: { name: 'nuxt-content-dynamic-collection' },
setup() {}
}) |
Beta Was this translation helpful? Give feedback.
-
Trying to remove some jank from this. This is where I am at: import { defineNuxtModule, useNuxt } from '@nuxt/kit'
import { defineCollectionSource } from '@nuxt/content'
import { hash } from 'ohash'
import ufo from 'ufo'
import fs from 'node:fs'
import type {
FileAfterParseHook,
ResolvedCollection,
ContentFile,
ResolvedCustomCollectionSource,
ParsedContentFile
} from '@nuxt/content'
import * as aq from 'arquero'
import type { ColumnTable, Table } from 'arquero'
import { v4 as generate_uuid } from 'uuid'
import type { SelectHelper } from 'arquero/dist/types/helpers/selection'
import type { RenameMap } from 'arquero/dist/types/table/types'
export interface BuiltCollection {
data: Record<string, FileAfterParseHook>
collection: ResolvedCollection
}
export class BuildData {
private _collections: Map<string, BuiltCollection>
private _fromCache = true
public constructor(collections?: BuiltCollection[]) {
this._collections = new Map<string, BuiltCollection>()
if (collections) {
for (const c of collections) {
this._collections.set(c.collection.name, c)
}
}
}
public get fromCache(): boolean {
return this._fromCache
}
public get collections(): ReadonlyMap<string, BuiltCollection> {
return this._collections
}
public seen(hook: FileAfterParseHook): void {
if (this._fromCache) {
this._fromCache = false
}
const { name } = hook.collection
const entry = this.collections.get(name)
?? { data: {}, collection: hook.collection }
entry.data[hook.file.id] = hook
this._collections.set(
name,
entry
)
}
public pick(names: readonly string[]): BuiltCollection[] {
return names.flatMap(n => this.collections.get(n) ?? [])
}
public all(): BuiltCollection[] {
return Array.from(this.collections.values())
}
public get(name: string): BuiltCollection | undefined {
return this.collections.get(name)
}
public toJson(): string {
const payload = {
collections: Array.from(this._collections.values()).map(c => ({
collection: c.collection,
data: c.data
}))
}
return JSON.stringify(payload)
}
public static fromJson(json: string): BuildData {
const parsed = JSON.parse(json)
const rebuilt: BuiltCollection[] = parsed.collections.map(({ collection, data }) => ({
collection,
data
}))
const bd = new BuildData(rebuilt)
bd._fromCache = true
return bd
}
}
export interface DynamicCollectionOptions {
collections: string[]
cacheKey?: string
process(data: BuildData): Promise<ContentFile[]>
serialize?(file: ContentFile): Promise<string>
}
const explodeRows = (rows: object[], key: string): object[] =>
rows.flatMap((r) => {
const v = r[key]
return Array.isArray(v)
? v.map(val => ({
...r,
[key]: val
}))
: [r]
})
function prefixFieldNames(builtCollection: BuiltCollection) {
return (t: Table): RenameMap => {
const renameMap: Record<string, string> = {}
for (const c of t.columnNames()) {
renameMap[c] = `${builtCollection.collection.name}_${c}`
}
return renameMap
}
}
export interface LeftJoinOptions {
leftCollection: string
leftField: string
collections: BuiltCollection[]
rightField:
| string
| Record<string, string>
outputDir?: string
onMatch?: (match: object) => ContentFile
}
export function leftJoin(options: LeftJoinOptions): ColumnTable | null {
const leftColl = options.collections.find(c => c.collection?.name === options.leftCollection)
if (!leftColl) return null
const leftRows = Object.values(leftColl.data)
.map(hook => ({
__collection: options.leftCollection,
__id: hook.file.id,
...(hook.content as ParsedContentFile),
...hook.file
}))
let joined = aq
.from(explodeRows(leftRows, options.leftField))
.rename(prefixFieldNames(leftColl))
/* ----------------------------- join rights -------------------------------- */
for (const builtCollection of options.collections) {
if (builtCollection.collection.name === options.leftCollection) continue
// determine the RHS key to use for this collection
const rhsKey
= typeof options.rightField === 'string'
? options.rightField
: options.rightField[builtCollection.collection.name]
if (!rhsKey) continue
const rightRows = Object.values(builtCollection.data)
.map(hook => ({
__collection: builtCollection.collection.name,
__id: hook.file.id,
...(hook.content as ParsedContentFile),
...hook.file
}))
// Create the right table, and prefix everything with the collection name
const right = aq
.from(explodeRows(rightRows, rhsKey))
.rename(prefixFieldNames(builtCollection))
// left‑join on the requested keys
joined = joined.join_left(right, [
`${leftColl.collection.name}_${options.leftField}`,
`${builtCollection.collection.name}_${rhsKey}`
])
}
return joined
}
export interface InnerJoinOptions {
leftCollection: string
leftField: string
collections: BuiltCollection[]
rightField:
| string
| Record<string, string>
outputDir?: string
onMatch?: (match: object) => ContentFile
}
export function innerJoin(options: InnerJoinOptions): ColumnTable | null {
const leftColl = options.collections.find(
c => c.collection?.name === options.leftCollection
)
if (!leftColl) return null
// Build the left (base) table
const leftRows = Object.values(leftColl.data).map(hook => ({
__collection: options.leftCollection,
__id: hook.file.id,
...(hook.content as ParsedContentFile),
...hook.file
}))
let joined = aq
.from(explodeRows(leftRows, options.leftField))
.rename(prefixFieldNames(leftColl))
/* ----------------------------- join rights -------------------------------- */
for (const builtCollection of options.collections) {
if (builtCollection.collection.name === options.leftCollection) continue
// pick the RHS key for this collection
const rhsKey
= typeof options.rightField === 'string'
? options.rightField
: options.rightField[builtCollection.collection.name]
if (!rhsKey) continue
// build RHS table, prefixing its columns
const rightRows = Object.values(builtCollection.data).map(hook => ({
__collection: builtCollection.collection.name,
__id: hook.file.id,
...(hook.content as ParsedContentFile),
...hook.file
}))
const right = aq
.from(explodeRows(rightRows, rhsKey))
.rename(prefixFieldNames(builtCollection))
// inner-join on the requested keys
joined = joined.join(right, [
`${leftColl.collection.name}_${options.leftField}`,
`${builtCollection.collection.name}_${rhsKey}`
])
}
return joined
}
function hashObject(obj) {
const json = JSON.stringify(obj, Object.keys(obj).sort())
return hash(json)
}
export interface GenerateVirtualFilesOptions {
outputDir?: string
filename?: (row: object) => string
transform?: (row: object) => object
}
export function generateVirtualFiles(
table: ColumnTable,
options?: GenerateVirtualFilesOptions
): ContentFile[] {
const mergedOptions = {
outputDir: '/virtual',
filename: (row: object): string => `${hashObject(row)}.json`,
transform: (row: object): object => row,
...options ?? {}
}
const results = table
.objects()
.map((row: object) => {
const data = mergedOptions.transform
? mergedOptions.transform(row)
: row
const filePath = ufo.joinURL(
'/',
mergedOptions.outputDir,
`${mergedOptions.filename(row)}`
)
return <ContentFile>{
path: filePath,
body: JSON.stringify(data, null, 2),
extension: '.json'
}
})
return results
}
declare module '@nuxt/schema' {
interface NuxtHooks {
'dynamic-collections:buildData:updated': (ctx: BuildData) => void
}
}
export function defineDynamicSource(
opts: DynamicCollectionOptions
): ResolvedCustomCollectionSource {
const nuxt = useNuxt()
const processed: ContentFile[] = []
let buildData: BuildData = new BuildData()
nuxt.hook(
'dynamic-collections:buildData:updated',
data => buildData = data
)
async function prepare() {
processed.push(
...(
await opts.process(
new BuildData(
buildData.pick(opts.collections)
)
)
)
)
}
async function getKeys() {
return processed.map(f => f.path)
}
async function getItem(key: string): Promise<string> {
return processed.find(
f => f.path === key
)?.body ?? ''
}
return defineCollectionSource({
prepare,
getKeys,
getItem
})
}
export default defineNuxtModule({
meta: { name: 'nuxt-content-dynamic-collection' },
async setup(_, nuxt) {
let build = new BuildData()
const rootDir = nuxt.options.rootDir
const cacheDir = ufo.joinURL(rootDir, '.nuxt', 'content-cache')
const cacheFile = ufo.joinURL(cacheDir, 'content.json')
nuxt.hook('builder:generateApp', async () => {
if (fs.existsSync(cacheFile) && build.fromCache) {
const json = await fs.promises.readFile(
cacheFile,
'utf8'
)
build = BuildData.fromJson(json)
}
await nuxt.callHook('dynamic-collections:buildData:updated', build)
})
nuxt.hook('content:file:afterParse', async (ctx) => {
build.seen(ctx)
await nuxt.callHook('dynamic-collections:buildData:updated', build)
})
nuxt.hook('modules:done', async () => {
if (build.fromCache) return
const rootDir = nuxt.options.rootDir
const cacheDir = ufo.joinURL(
rootDir,
'.nuxt',
'content-cache'
)
if (!fs.existsSync(cacheDir)) {
await fs.promises.mkdir(cacheDir, {
recursive: true
})
}
const cacheFile = ufo.joinURL(
cacheDir,
'content.json'
)
// Or write all the data
await fs.promises.writeFile(
cacheFile,
build.toJson(),
'utf8'
)
})
}
}) In my browse_items: defineCollection({
schema: createJoinCollectionSchema(),
type: 'page',
source: defineDynamicSource({
collections: [
'browse',
'members',
'blog',
'podcast'
],
async process(data: BuildData): Promise<ContentFile[]> {
const memberDataRaw = innerJoin({
collections: data.pick([
'browse',
'members'
]),
leftCollection: 'browse',
leftField: 'token',
rightField: 'categories'
})
if (!memberDataRaw) {
return []
}
const memberDataRenamed = memberDataRaw
.select({
members_id: 'id',
members___collection: 'collection',
members_token: 'token',
members_stem: 'stem',
members_categories: 'categories',
members_navigation: 'navigation',
members_title: 'title',
members_description: 'description',
members_page_type: 'page_type',
members_presenters: 'presenters',
members_tags: 'tags',
members_body: 'body'
})
const virtualFiles = generateVirtualFiles(
memberDataRenamed
)
return virtualFiles
}
})
}) The 'Body' field is currently incorrect, it is currently the entire JSON object that was computed. It might be better to switch to mdc with frontmatter? This is correctly building what I want, for the most part, however. A nice thing to fall out of this, is that downstream dynamic sources can use the output of a previous one, so you can build up a series of joins at build time, if you need such a thing. You can also utilize external sources, such as bitbucket or GitHub, and then pivot your local data or an arbitrary API source. If you need a way to define the virtual filename you can, just keep in mind it needs to be deterministic in order to avoid invalidating the cache. I am using the same hash method from nuxt content to generate a virtual filename for the row by default. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey all,
I tried asking on the discord if there is another package to do this, and didn't receive feedback, so I figured I'd publish what I have for this. I do not have time currently to maintain this as a module, but would appreciate any feedback (or other solutions) if you have time. This is a long write-up about the problem and the solution I came up with. This is still a WIP. Sorry for the TLDR, code at the bottom.
Overview of the problem
On our website we have a few collections that really need to be computed. We run an e-learning website, and have multiple collections that cover a bunch of different types of content. For example, we have a blog, podcast, quizzes, videos, playlists, etc.
We have two collections we'd like to essentially be tag driven. Namely, a main
browse
collection, which is a list of links to various content, and apresenters
collection, which, again, is a list of links and meta-data to each particular piece of content a presenter was involved in.So on the
presenters
page, we might want to show a list of their blog posts, podcast episodes they've appeared in, educational courses they've produced, etc.This leads to a situation where we needed to query essentially every collection to show a presenter, which meant downloading all of our sql-dumps.
What I really needed was a way to generate a computed collection of navigation items for presenters, which just pulls the meta-data and navigational links to the related content into another collection we can query specifically to find related items.
A more conventional use-case might be categories. Lets say you have a collection of categories, and a ton of different content on your website, which has the following frontmatter:
On the category, you might have something like this
And, on the blog posts, which is in a different collection, you might want to say:
When you visit
mysuperawesomecategory
, you'd want to be able to show all blogs which appear under that category without pulling the entirety of theblog
dump.Instead, if we have a third collection, which is category_items, which is a join between matching categories, we can just grab some meta-data about our navigation to properly display the page, without downloading all of our content.
We may also want some built-time validation such that if
someothercategory
doesnt exist, our build process has the opportunity to complain about it.Finally, if we solve this particular issue, we can do quiet a lot with things like aggregates and more dynamic content via collections.
I want this to be built into the broader
nuxt-content
eco-system, so we need a way to build a virtual collection which gets handled by the other parts ofnuxt-content
through the tools we have. I do not want to, for example, directly modify the sql-dump, or pre-render json via nitro, for example. I just want a dynamic collection.Proposed solution
We can use the
CustomSource
to inject content into thenuxt-content
build process. What we would need to do this is both collections, in their entirety, in memory. Then, we can do our join, or dynamic logic, and return virtual files back.Nuxt Content offers a few hooks we can tie into. However, these operate on a per-file basis.
Needing a way to get the full content, we can create an observer and just write down each file we see across the wire if it matches a category we care about.
You can do that like:
Where
BuildData
is an object which keeps track of our collections and file (see the full source).Nuxt content will load the collections in order, and their respective files which means you can do something like this:
This will listen to all the collections in the collections array and hold them in memory as they load, then call a process function with all the built data that source cares about, which expects a collection of ContentFile. I am using ContentFile here just because it has everything we more-or-less want to do this. In our custom source, we can just call a single function to generate and process the files, and then handle the lookup with getKeys and getItem. It will serialize our object to JSON and return a JSON file from the source.
This obviously does not take into account any memory considerations and isn't super efficient, so feedback would be appreciated.
Source Code
This uses
arquero
(which is similar to Pandas if you're familiar), which lets us handle the data as a table in memory, sopnpm install arquero
Throw this in a folder
~~/modules/nuxt-content-dynamic-collection
namedindex.ts
In your
nuxt.config.ts
, our module comes after@nuxt/content
:In your
content.config.ts
, define a dynamic source:Notes
Join logic is super questionable and a WIP, I am including it just as an example. It will currently prefix each table with its collection name. Eg, if you use the leftJoin function as that example, the result will be
browse_${field}..., pages_${field}...
etc. If you use that code, ensure you either use.rename
to rename the fields, or take into account the prefix in your schema.Sorry for the TLDR, I am just hoping someone out there, if they are facing the same problem, is willing to take this and make a better module, or help improve this in some way.
Thanks for coming to my TED talk
Edit: Slightly updated version along with an example of a join
Beta Was this translation helpful? Give feedback.
All reactions