ai-notes/TEXT.md at main · old247/ai-notes

Table of Contents

Language Models
Applications
Top GPT3 Prompt Engineering Reads
How GPT works
Don't call it generative
Specialized language models
GPT Products
GPT tooling
Ethical issues
Flan-T5
Misc Text AI

My best timeline of GPT efforts is listed here: https://lspace.swyx.io/p/open-source-ai

Datasets

WebText and OpenWebText
- GPT-2’s training data is based on Reddit, which according to Pew Internet Research’s 2016 survey, 67% of Reddit users in the US are men, 64% between ages 18 and 29. https://stanford-cs324.github.io/winter2022/lectures/data/
Common Crawl: https://commoncrawl.org/
- We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone.
- January 2015 it was over 139TB in size and contains 1.82 billion webpages. https://commoncrawl.github.io/cc-crawl-statistics/plots/crawlsize
- growing steadily at 200-300 TB per month for the last few years.
- March 2021: 6.4 PB
- https://commoncrawl.github.io/cc-crawl-statistics/
- OSCAR corpus from INRIA. OSCAR is a huge multilingual corpus obtained by language classification and filtering of Common Crawl dumps of the Web.
C4: https://paperswithcode.com/dataset/c4
- C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset: https://commoncrawl.org. It was used to train the T5 text-to-text Transformer models.
- It comes in four variants:
  - en: 305GB in JSON format
  - en.noblocklist: 380GB in JSON format
  - en.noclean: 2.3TB in JSON format
  - realnewslike: 15GB in JSON format
The Pile: https://arxiv.org/abs/2101.00027
- an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets -- both existing and newly constructed -- many of which derive from academic or professional sources
- behind the scenes of the collection https://news.ycombinator.com/item?id=34359453

GPT3 data https://stanford-cs324.github.io/winter2022/lectures/data/#gpt-3-dataset

Language Models

GPT3 advanced a lot through 2020-2022 https://twitter.com/tszzl/status/1572350675014516738
Eleuther's GPT-J-6B, GPT-NeoX
Google
- PaLM 570B
  - https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/
  - FLAN-T5
    - - Flan-T5 checkpoints are publicly available (without requesting access)
    - Flan-T5 11B outperforms OPT-IML on MMLU and Big-Bench Hard, despite being 10x more compute efficient
    - https://huggingface.co/docs/transformers/model_doc/flan-t5
Yandex YaLM 100B https://medium.com/yandex/yandex-publishes-yalm-100b-its-the-largest-gpt-like-neural-network-in-open-source-d1df53d0e9a6
- It took us 65 days to train the model on a pool of 800 A100 graphics cards and 1.7 TB of online texts, books, and countless other sources.
Tsinghua GLM-130B
- outperforms OpenAI's GPT-3 175B and Google's PALM 540B on critical benchmarks. AND it's open sourced, which means — you can run this model on your own machine, for free.
- only trained on 400B tokens (compared to 1.2T tokens for Chinchilla's 70B parameters)
- https://twitter.com/AndyChenML/status/1611529311390949376?s=20
Meta
- OPT-175B https://opt.alpa.ai/ (bad reviews)
- OPT-IML (Instruction Meta-Learning): instruction tuned https://github.com/facebookresearch/metaseq/tree/main/projects/OPT-IML
  - a new language model from Meta AI with 175B parameters, fine-tuned on 2,000 language tasks — openly available soon under a noncommercial license for research use cases.
  - instruction-finetuned, leverages Chinchilla scaling laws, and has bells and whistles like 4-bit quantization and bidirectional attention. With 4-bit quantization, the model can run on 1 x 80 GB A100 or a consumer GPU rig.
  - https://twitter.com/MetaAI/status/1605991218953191424
  - underperforms Flan-T5 https://twitter.com/_jasonwei/status/1621333297891790848?s=20
FlashAttention - 3-5x faster training (tweet, huggingface)
GPT-JT for classification
GPT 3.5 (https://beta.openai.com/docs/model-index-for-researchers)
- code-davinci-002 is a base model, so good for pure code-completion tasks
- text-davinci-002 is an InstructGPT model based on code-davinci-002
- text-davinci-003 is an improvement on text-davinci-002
  - https://scale.com/blog/gpt-3-davinci-003-comparison
    - 003 is 30% better at classifying, can rhyme, output iambic pentameter, is more verbose (42 words per sentence vs 23).
  - https://twitter.com/amix3k/status/1597504050852859904
  - https://twitter.com/_brivael_/status/1597625080418533377
- InstructGPT https://openai.com/blog/instruction-following/
- ChatGPT: https://openai.com/blog/chatgpt/
  - We’ve trained a model called ChatGPT which interacts in a conversational way. The dialogue format makes it possible for ChatGPT to answer followup questions, admit its mistakes, challenge incorrect premises, and reject inappropriate requests.

GPT4

gpt4 speculations and improvement directions https://mobile.twitter.com/mayfer/status/1607816595065409536
https://twitter.com/RamaswmySridhar/status/1605603050403483652?s=20&t=0zl_ZGLHLxjgJ-FLk-m-Fg
- Biggest model size for GPT-4 will be 1T parameters. Up 6x. Not 100T - The reason is simple: instruction fine tuning achieves same quality with 100x smaller models.
- GPT-4 will use 10T tokens. Up 33x, and putting them on the Chinchilla scaling curve.
- We expect 16384 tokens
- Biggest pre-training modeling change? A loss function that looks like UL2
- Put together, at least 800x more compute for the pre-trained model.

Applications

GPT3 applications:

text to graphviz https://twitter.com/goodside/status/1561549768987496449?s=21&t=rliacnWOIjJMiS37s8qCCw
suspending to python for math
Amelia paragraph sumarizer https://twitter.com/Wattenberger/status/1412480516268437512�
Karina Nguyen Synevyr https://twitter.com/karinanguyen_/status/1566884536054677506
Lex.page
https://github.com/louis030195/obsidian-ava obsidian integration
https://humanloop.com/ Playground that brings variable interpolation to prompts and lets you turn them into API endpoints. Once you're deployed, it also lets you collect past generations along with user behavioral feedback for fine-tunes.
https://www.everyprompt.com/ extends Playground in a similar way: Putting variables in prompts and giving you a single button to go from prompt to API. Has nice developer-oriented touches in the UI too — e.g. displaying invisible chars as ghosts.
explaining code diffs https://app.whatthediff.ai/dashboard
LangChain https://twitter.com/hwchase17/status/1588183981312180225
- implements and lets you easily compose many published LLM prompting techniques. Implements self-asking, web search, REPL math, and several of my own prompts.
- All relevant chains now have a "verbose" option to highlight text according to the model or component (SQL DB, search engine, python REPL, etc) that it's from.
https://dust.tt/ gives a collapsible tree UI for representing k-shot example datasets, prompt templates, and prompt chaining with intermediate JS code. Replaces a lot of code around prompt APIs.
playing chess??? https://twitter.com/Raza_Habib496/status/1591514638520311809
chrome extension
simulating people
- https://jack-clark.net/2022/10/11/import-ai-305-gpt3-can-simulate-real-people-ai-discovers-better-matrix-multiplication-microsoft-worries-about-next-gen-deepfakes/
Making stories with characters https://medium.com/@turc.raluca/introducing-rick-and-mortify-a14e56a8cb67

wiring up LLMs to python https://twitter.com/karpathy/status/1593081701454204930?s=20&t=2ra2Yfz0NFSbfJ_IGixNjA

How GPT works

original paper Improving Language Understanding by Generative Pre-Training Radford et al 2018 https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
https://github.com/karpathy/minGPT
- announcement https://twitter.com/karpathy/status/1295410274095095810
- used in https://www.mosaicml.com/blog/gpt-3-quality-for-500k
- check out nanoGPT too
https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their-Sources-b9a57ac0fcf74f30a1ab9e3e36fa1dc1
- There are three important abilities that the initial GPT-3 exhibit: Language generation, In-context learning, World knowledge
- to pretrain the 175B parameters model on 300B tokens (60% 2016 - 2019 C4 https://www.tensorflow.org/datasets/catalog/c4 + 22% WebText2 + 16% Books + 3% Wikipedia).
- Language generation comes from training objective
- World knowldge comes from 300b token corpus and stored in 175b model
- The ability of complex reasoning with chain-of-thought is likely to be a magical side product of training on code
- We have concluded:
  - The language generation ability + basic world knowledge + in-context learning are from pretraining (davinci)
  - The ability to store a large amount of knowledge is from the 175B scale.
  - The ability to follow instructions and generalizing to new tasks are from scaling instruction tuning (davinci-instruct-beta)
  - The ability to perform complex reasoning is likely to be from training on code (code-davinci-002)
  - The ability to generate neutral, objective, safe, and informative answers are from alignment with human. Specifically:
    - If supervised tuning, the resulting model is text-davinci-002
    - If RLHF, the resulting model is text-davinci-003
    - Either supervised or RLHF, the models cannot outperform code-davinci-002 on many tasks, which is called the alignment tax.
  - The dialog ability is also from RLHF (ChatGPT), specifically it tradeoffs in-context learning for:
    - Modeling dialog history
    - Increased informativeness
    - Rejecting questions outside the model’s knowledge scope

Don't call it generative

Reasoning: https://twitter.com/alexandr_wang/status/1588933553290870785
Understanding: https://twitter.com/EMostaque/status/1585903983180537856

Specialized language models

Scientific language models like Meta's Galactica exist. Commentary https://news.ycombinator.com/item?id=33614608

GPT Products

directory
- https://gpt3demo.com/
Jasper
CopyAI
Features of existing products
- NotionAI
- https://hashnode.com/neptune
Email
- Ellie email https://twitter.com/JamesIvings/status/1602855048148500480
- Everyprompt mail
- https://merlin.foyer.work/
Summarizers
- explainpaper
- kagi universal summarizer https://labs.kagi.com/ai/sum?url=airbyte.io
SQL
- preplexity.ai/sql
  - https://twitter.com/perplexity_ai/status/1605250295780773889
Newer
- https://www.protocol.com/generative-ai-startup-landscape-map
- https://metaphor.systems/
- dust.tt
- tools that make tools (toolbot.ai)
- https://lex.page (announcement)
- CLI https://twitter.com/KevinAFischer/status/1601883697061380096?s=20
- Zapier OpenAI integrations https://zapier.com/apps/openai/integrations
- SlackGPT https://zapier.com/shared/query-gpt-3-via-a-slack-channel/a7551c53beda75b3bf65c315f027de04a4b323ef
- got3discord moderator https://github.com/Kav-K/GPT3Discord
- extract gpt chrome extension https://twitter.com/kasrak/status/1624515411973922816?s=20
- Embra macos desktop chatgptlike
  - https://twitter.com/zachtratar/status/1623015294569713665?s=20&t=hw_somO_R_JxGp4zQpFz0Q
  - competes with Dust XP1
Writing
- Verb (fiction) https://twitter.com/verbforwriters/status/1603051444134895616
- Orchard https://www.orchard.ink/doc/201a7f63-731e-4487-926a-fdf348f1b00c
  - https://twitter.com/alexjkwang/status/1603408050005557249?s=20
- Deepmind Dramatron https://deepmind.github.io/dramatron/details.html for co-writing theatre scripts and screenplays. Starting from a log line, Dramatron interactively generates character descriptions, plot points, location descriptions and dialogue. These generations provide human authors with material for compilation, editing, and rewriting.
- BearlyAI https://twitter.com/TrungTPhan/status/1597623720239329280f
google sheets https://twitter.com/mehran__jalali/status/1608159307513618433

GPT tooling

mostly from https://twitter.com/goodside/status/1588247865503010816

Humanloop.com Playground - variable interpolations + api endpoints, collect generations with feedback
Everyprompt.com Playground - similar to above with ux improvements
Introducing http://Promptable.ai The Ultimate Workspace for Prompt Engineering
- 1. A Delightful Prompt Editor - Organize Prompts in Folders. - Stream completions. - Add variables. - Change Model Parameters (even custom models!)
- 1. Batch Testing! We built a table that you can run completions on to evaluate your prompts How? - Add multiple inputs that your prompt needs to handle - Click run and process them in batch. - Add annotation columns (MultiSelect, Markdown) to keep track of the status of Inputs
- 1. Version and Deploy. You're happy with your completions, You added some test cases. It's time to ship! Deploy your prompt and fetch the latest version from our API. Simple storage. No added latency between you and your LLM.
Langchain python package - implements many techniques
Lamdaprompt https://github.com/approximatelabs/lambdaprompt
- used in pandas extension https://github.com/approximatelabs/sketch
https://gpt-index.readthedocs.io/en/latest/ GPT Index is a project consisting of a set of data structures designed to make it easier to use large external knowledge bases with LLMs.
Dust.tt - tree UI for k-shot datasets, prompt templates, prompt chaining
Spellbook from ScaleAI - automatically write k-shots, eval metrics for prompt varaints, prompts to spreadsheet functions
Linus/thesephist tools
- tree of branches https://twitter.com/thesephist/status/1590545448066252800
- scrubbing a text for length https://twitter.com/thesephist/status/1587929014848540673
- Most knowledge work isn't a text-generation task, and your product shouldn't ship an implementation detail of LLMs as the end-user interface https://twitter.com/thesephist/status/1592924891208372224
mozilla's readability-cli https://www.npmjs.com/package/readability-cli (tip)

dealing with GPT context size

there is actually a paper by OpenAI themselves on summarizing long document. essentially, break a longer text into smaller chunks, and run a multi-stage sequential summarization. each chunk uses a trailing window of previous chunk as context, and run this recursively. https://arxiv.org/abs/2109.10862. more: https://news.ycombinator.com/item?id=34423822
https://github.com/jerryjliu/gpt_index
- Current state: LLM’s have made phenomenal progress in encoding knowledge as well as reasoning. BUT a big limitation of LLM’s is context size (4096 in Davinci), and if you want to feed an LLM custom knowledge it will either need to fit in the prompt or be finetuned (expensive)!
- https://twitter.com/mathemagic1an/status/1609225733934616577?s=46&t=DgrykKeTlGWgdxRkv2_tKw
godly ai https://godly.ai
HyDE: Hypothetical Document Embeddings
- https://twitter.com/mathemagic1an/status/1615378778863157248
- Take your query => create hypothetical answer => embed hypothetical answer => use this to search through doc embeddings
Structured Prompting: Scaling In-Context Learning to 1,000 Examples

Ethical issues

tokens are more 16x expensive for other languages https://denyslinkov.medium.com/why-is-gpt-3-15-77x-more-expensive-for-certain-languages-2b19a4adc4bc
Galactica fallout

Flan-T5

https://twitter.com/quocleix/status/1583523186376785921
Flan-T5 is instruction-finetuned on 1,800+ language tasks, leading to dramatically improved prompting and multi-step reasoning abilities.
- 7 min summary video https://www.youtube.com/watch?v=oqi0QrbdgdI

Misc Text AI

OpenAI NarrativeQA Summarizing books https://openai.com/blog/summarizing-books/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets

Language Models

GPT4

Applications

Top GPT3 Prompt Engineering Reads

How GPT works

Don't call it generative

Specialized language models

GPT Products

GPT tooling

dealing with GPT context size

Ethical issues

Flan-T5

Misc Text AI

FilesExpand file tree

TEXT.md

Latest commit

History

TEXT.md

File metadata and controls

Datasets

Language Models

GPT4

Applications

Top GPT3 Prompt Engineering Reads

How GPT works

Don't call it generative

Specialized language models

GPT Products

GPT tooling

dealing with GPT context size

Ethical issues

Flan-T5

Misc Text AI