A Powerful Generic TypeScript Function for Generating Valuable RAG Texts
In a Nutshell: Given X, what’s the probability of Y occurring?
Posted on August 20, 2023
The full code example used in this post is on TypeScript Playground.
Generating Deterministic Statistics for LLMs That are Bad at Deterministic Things
I’m currently hard at work on my latest SaaS product, AMT JOY, which is a historical, statistical, and probabilistic tool covering futures trading sessions, focusing on Auction Market Theory, or AMT. Part of the product we are calling “AMT GPT” — a chat-based tool where you can ask for any information about any historical session that ever occurred, eventually with the idea of planning intraday plays in real-time. This is where I discovered embeddings and Retrieval Augmented Generation (RAG). At first, skeptical as always with the accuracy of these LLMs, I was fully expecting the very poor results which I indeed was seeing initially. For example, I asked a very simple “What trading session had the highest return?” and I was each time getting a completely different answer! That’s when I realized:
This was completely my fault.
I had to first inform myself about how RAG works. In the background, RAG is first finding a select number of matching documents, and then creating a response from them. Since I only had per-day session descriptions, where the return of the session is only a small part of those description files, of course, the initial selection was faulty, and then the LLM could only report on the session that was returned. However, as soon as I included a sort of “overall stats” file, it started working much more reliably. This stats file literally had the line “The session with the highest return occurred on …”. This was definitely what the RAG was finding in subsequent queries. That’s when I realized, to make the RAG work for a vast variety of statistical and probabilistic questions, I needed mounds and mounds of not just my data converted to human readable / normal language documents — and crucially not just randomly generated data, but highly structured and statistically accurate data. Only then would AMT GPT start to be useful.
Generating the Data
At first, I started with a sort of alpha, using string interpolation to format various files, like my ‘stats’ file:
...
return `# Stats overview Data for ${data[0].symbol}
The session with the highest return was ${sessionWithHighestReturn.formattedDate} (${sessionWithHighestReturn.sessionId}) with a return of ${sessionWithHighestReturn.sessionReturn.toFixed(2)}.
The session with the lowest return was ${sessionWithLowestReturn.formattedDate} (${sessionWithLowestReturn.sessionId}) with a return of ${sessionWithLowestReturn.sessionReturn.toFixed(2)}.
The session with the highest volume was ${sessionWithHighestVolume.formattedDate} (${sessionWithHighestVolume.sessionId}) with a volume of ${sessionWithHighestVolume.totalVolume}.
The session with the lowest volume was ${sessionWithLowestVolume.formattedDate} (${sessionWithLowestVolume.sessionId}) with a volume of ${sessionWithLowestVolume.totalVolume}.
The session with the highest A period return was ${sessionWithHighestAPeriodReturn.formattedDate} (${sessionWithHighestAPeriodReturn.sessionId}) with a return of ${sessionWithHighestAPeriodReturn.candles[0].r.toFixed(2)}.
The session with the lowest A period return was ${sessionWithLowestAPeriodReturn.formattedDate} (${sessionWithLowestAPeriodReturn.sessionId}) with a return of ${sessionWithLowestAPeriodReturn.candles[0].r.toFixed(2)}.`;
};
...
I realized however, there were far more correlations possible, and since potentially thousands of traders will use AMT JOY, a correlation that I may think is useless, unnecessary, or not even realize entirely, may in fact be an essential part of another trader’s strategy! Just look at our ISessionStats interface, and realize how many “Given” … “Then” correlations are possible:
interface ISessionStats {
index: number
symbol: string
sessionId: string
formattedDate: string
dayOfWeek: string
month: string
year: number
candles: Candle[]
dailyOTF: string
weeklyOTF: string
monthlyOTF: string
orTrend: string
orExtensionUp: string
orExtensionDown: string
ibTrend: string
trTrend: string
ibExtensionUp: string
ibExtensionDown: string
sessionType: string
sessionAnnotation: string
openingRange: Range
initialBalance: Range
tradingRange: Range
vwap: VWAPWithStdDev[]
timeAboveOR: number
timeInOR: number
timeBelowOR: number
timeAboveIB: number
timeInIB: number
timeBelowIB: number
timeAboveTR: number
timeInTR: number
timeBelowTR: number
timeAboveSD1: number
timeWithinSD1: number
timeBelowSD1: number
timeAboveSD2: number
timeWithinSD2: number
timeBelowSD2: number
timeAboveSD3: number
timeWithinSD3: number
timeBelowSD3: number
timeAboveSD4: number
timeWithinSD4: number
timeBelowSD4: number
vwapCrossesUp: number
vwapCrossesDown: number
closeInRelationToOR: string
closeInRelationToIB: string
candleClosesBelowBelowORMidpoint: number
candleClosesAboveAboveORMidpoint: number
candleClosesBelowBelowIBMidpoint: number
candleClosesAboveAboveIBMidpoint: number
crossesORHighPeriodUp: string[]
crossesORHighPeriodDown: string[]
crossesORLowPeriodDown: string[]
crossesORLowPeriodUp: string[]
crossesIBHighPeriodUp: string[]
crossesIBHighPeriodDown: string[]
crossesIBHighPeriodDownCount: number
crossesIBHighPeriodUpCount: number
crossesIBLowPeriodDown: string[]
crossesIBLowPeriodUp: string[]
crossesIBLowPeriodUpCount: number
crossesIBLowPeriodDownCount: number
openingDrive: string
gapName: string
gapPercent: number
gapFillLevel: number
gapFilled: string
gapFillPeriod: string
sessionReturn: number
totalVolume: number
totalVolumePercentile: number
ibHighFromOpenPercentChange: number
ibLowFromOpenPercentChange: number
lowestLevelFromOpenPercentChange: number
lowestLevelPeriod: string
highestLevelFromOpenPercentChange: number
highestLevelPeriod: string
trueRange: number
averageTrueRange: number
aPeriodTrend: string
bPeriodTrend: string
cPeriodTrend: string
dPeriodTrend: string
ePeriodTrend: string
fPeriodTrend: string
gPeriodTrend: string
hPeriodTrend: string
iPeriodTrend: string
jPeriodTrend: string
kPeriodTrend: string
lPeriodTrend: string
mPeriodTrend: string
phod: number
plod: number
prevClose: number
mostSimilarSessionsByReturn: Array<ISessionCorrelation>
}
In any case, a generic or at the very least, abstract approach is required to efficiently and effectively build a software solution to build as many statistics-based files as possible. The more combinations of stats we can make, the more powerful our RAG.
Given This, Then…
The first function I wrote takes an array of your objects (of any type — that’s the power of generics!) and generates “given X, the probability of Y occurring is Z” type sentences. There are two parts: the first part, we need to calculate all the different possibilities of the single metric alone. Only then can we combine the various combinations of metrics to build all our Given / Then sentences:
const calculatePropertyStats = <T>(
objects: T[],
metrics: IProbabilityMetric<T>[]
): IPropertyStats<T> => {
const propertyStats: IPropertyStats<T> = {
totalCount: 0,
uniqueValues: new Map<keyof T, Set<T[keyof T]>>(),
valueCounts: new Map<keyof T, Map<T[keyof T], number>>(),
}
for (const obj of objects) {
propertyStats.totalCount++
for (const metric of metrics) {
const propertyKey = metric.property
const propertyValue = obj[propertyKey]
if (!propertyStats.uniqueValues.has(propertyKey)) {
propertyStats.uniqueValues.set(
propertyKey,
new Set<T[keyof T]>()
)
propertyStats.valueCounts.set(
propertyKey,
new Map<T[keyof T], number>()
)
}
propertyStats.uniqueValues.get(propertyKey)!.add(propertyValue)
if (
!propertyStats.valueCounts.get(propertyKey)!.has(propertyValue)
) {
propertyStats.valueCounts
.get(propertyKey)!
.set(propertyValue, 0)
}
propertyStats.valueCounts
.get(propertyKey)!
.set(
propertyValue,
propertyStats.valueCounts
.get(propertyKey)!
.get(propertyValue)! + 1
)
}
}
return propertyStats
}
Leveraging this function, we achieve the parent function, generateProbabilitySentences
:
const generateProbabilitySentences = <T>(
objects: T[],
metrics: IProbabilityMetric<T>[]
): string[] => {
const propertyKeys = metrics.map((metric) => metric.property)
const propertyStats: IPropertyStats<T> = calculatePropertyStats(
objects,
metrics
)
const sentences: string[] = []
for (const n1PropertyKey of propertyKeys) {
for (const n1Value of propertyStats.uniqueValues.get(n1PropertyKey)!) {
for (const n2PropertyKey of propertyKeys) {
for (const n2Value of propertyStats.uniqueValues.get(
n2PropertyKey
)!) {
const n1ValueCount = propertyStats.valueCounts
.get(n1PropertyKey)!
.get(n1Value)!
const intersectionCount = objects.filter(
(obj) =>
obj[n1PropertyKey] === n1Value &&
obj[n2PropertyKey] === n2Value
).length
const probability =
n1ValueCount > 0
? (
(intersectionCount / n1ValueCount) *
100
).toFixed(2)
: 0
const label1 = metrics.find(
(metric) => metric.property === n1PropertyKey
)?.label
const label2 = metrics.find(
(metric) => metric.property === n2PropertyKey
)?.label
const propIndex1 = propertyKeys.indexOf(n1PropertyKey)
const propIndex2 = propertyKeys.indexOf(n2PropertyKey)
if (label1 !== label2 && propIndex1 < propIndex2) {
sentences.push(
`Given ${label1} is ${n1Value}, the probability of ${label2} ${n2Value} is ${probability}%.`
)
}
}
}
}
}
return sentences
}
Example
Observe the following example data shape, ISalesData
:
interface ISalesData {
dayOfWeek: string
productSold: string
usedCoupon: boolean
}
and data salesData
:
const salesData: ISalesData[] = [
{ dayOfWeek: "Monday", productSold: "Product A", usedCoupon: true },
{ dayOfWeek: "Tuesday", productSold: "Product B", usedCoupon: false },
{ dayOfWeek: "Wednesday", productSold: "Product C", usedCoupon: true },
{ dayOfWeek: "Thursday", productSold: "Product A", usedCoupon: false },
{ dayOfWeek: "Friday", productSold: "Product B", usedCoupon: true },
{ dayOfWeek: "Saturday", productSold: "Product C", usedCoupon: false },
{ dayOfWeek: "Sunday", productSold: "Product A", usedCoupon: true },
{ dayOfWeek: "Monday", productSold: "Product B", usedCoupon: false },
{ dayOfWeek: "Tuesday", productSold: "Product C", usedCoupon: true },
{ dayOfWeek: "Wednesday", productSold: "Product A", usedCoupon: false },
{ dayOfWeek: "Thursday", productSold: "Product B", usedCoupon: true },
{ dayOfWeek: "Friday", productSold: "Product C", usedCoupon: false },
{ dayOfWeek: "Saturday", productSold: "Product A", usedCoupon: true },
{ dayOfWeek: "Sunday", productSold: "Product B", usedCoupon: false },
{ dayOfWeek: "Monday", productSold: "Product C", usedCoupon: true },
{ dayOfWeek: "Tuesday", productSold: "Product A", usedCoupon: false },
{ dayOfWeek: "Wednesday", productSold: "Product B", usedCoupon: true },
{ dayOfWeek: "Thursday", productSold: "Product C", usedCoupon: false },
{ dayOfWeek: "Friday", productSold: "Product A", usedCoupon: true },
{ dayOfWeek: "Saturday", productSold: "Product B", usedCoupon: false },
]
What if we want to know the probability / chance a customer buys a certain product on a given day? Or the chance they used a coupon on a certain day? Easy, we just need to pass each of these given / then scenarios into generateProbabilitySentences :
// day of week and coupon
let metrics: IProbabilityMetric<ISalesData>[] = [
{
label: "The day of week",
property: "dayOfWeek",
},
{
label: "the customer using a coupon being",
property: "usedCoupon",
},
]
let sentences = generateProbabilitySentences(salesData, metrics)
// log all sentances to the console
sentences.forEach((sentence) => console.log(sentence))
// day of week and coupon
metrics = [
{
label: "The day of week",
property: "dayOfWeek",
},
{
label: "the customer purchasing the product",
property: "productSold",
},
]
sentences = generateProbabilitySentences(salesData, metrics)
// log all sentances to the console
sentences.forEach((sentence) => console.log(sentence))
and… drumroll please… our amazing output:
Given the day of week is Monday, the probability of the customer using a coupon being true is 66.67%.
Given the day of week is Monday, the probability of the customer using a coupon being false is 33.33%.
Given the day of week is Tuesday, the probability of the customer using a coupon being true is 33.33%.
Given the day of week is Tuesday, the probability of the customer using a coupon being false is 66.67%.
Given the day of week is Wednesday, the probability of the customer using a coupon being true is 66.67%.
Given the day of week is Wednesday, the probability of the customer using a coupon being false is 33.33%.
Given the day of week is Thursday, the probability of the customer using a coupon being true is 33.33%.
Given the day of week is Thursday, the probability of the customer using a coupon being false is 66.67%.
Given the day of week is Friday, the probability of the customer using a coupon being true is 66.67%.
Given the day of week is Friday, the probability of the customer using a coupon being false is 33.33%.
Given the day of week is Saturday, the probability of the customer using a coupon being true is 33.33%.
Given the day of week is Saturday, the probability of the customer using a coupon being false is 66.67%.
Given the day of week is Sunday, the probability of the customer using a coupon being true is 50.00%.
Given the day of week is Sunday, the probability of the customer using a coupon being false is 50.00%.
Given the day of week is Monday, the probability of the customer purchasing the product Product A is 33.33%.
Given the day of week is Monday, the probability of the customer purchasing the product Product B is 33.33%.
Given the day of week is Monday, the probability of the customer purchasing the product Product C is 33.33%.
Given the day of week is Tuesday, the probability of the customer purchasing the product Product A is 33.33%.
Given the day of week is Tuesday, the probability of the customer purchasing the product Product B is 33.33%.
Given the day of week is Tuesday, the probability of the customer purchasing the product Product C is 33.33%.
Given the day of week is Wednesday, the probability of the customer purchasing the product Product A is 33.33%.
Given the day of week is Wednesday, the probability of the customer purchasing the product Product B is 33.33%.
Given the day of week is Wednesday, the probability of the customer purchasing the product Product C is 33.33%.
Given the day of week is Thursday, the probability of the customer purchasing the product Product A is 33.33%.
Given the day of week is Thursday, the probability of the customer purchasing the product Product B is 33.33%.
Given the day of week is Thursday, the probability of the customer purchasing the product Product C is 33.33%.
Given the day of week is Friday, the probability of the customer purchasing the product Product A is 33.33%.
Given the day of week is Friday, the probability of the customer purchasing the product Product B is 33.33%.
Given the day of week is Friday, the probability of the customer purchasing the product Product C is 33.33%.
Given the day of week is Saturday, the probability of the customer purchasing the product Product A is 33.33%.
Given the day of week is Saturday, the probability of the customer purchasing the product Product B is 33.33%.
Given the day of week is Saturday, the probability of the customer purchasing the product Product C is 33.33%.
Given the day of week is Sunday, the probability of the customer purchasing the product Product A is 50.00%.
Given the day of week is Sunday, the probability of the customer purchasing the product Product B is 50.00%.
Given the day of week is Sunday, the probability of the customer purchasing the product Product C is 0.00%.
I can guarantee the statistics are accurate :)
You could then throw data like this into a document, then into your favorite vector store (like Pinecone), and query against it with any embedder of your choice (like GPT4)! You can then be sure your queries will be matched with accurate, and not hallucinated, data.
Of course, this is a toy example, and you can see how this function works best with properties that are string enums, i.e. a set list of countable strings. You would need to define additional rules for things like sums, time frames (“show me all sales for June / July / August”) averages, max, or min, and this is exactly what we’re working on!
More Coming!
In the coming weeks and months, I’m looking to scale this tool out into a fully separate product. Essentially, you’ll be able to put in your organization's data — whatever type it might be, and we’ll be able to generate all possible probabilities and statistics, and then you can use RAG or any LLM of your choice against it to extract mathematically true values. It replaces the need for any big data analysis completely, and it operates through an extremely human-like interface. Think of it like a friendly human assistant that has memorized the entirety of your org’s data set — including probabilities and statistics that you may not even have thought of yourself!
Also, if you’ve got this far, do you know of any tools working on similar problems like this? Or what this field of technology is called? I’d love to research
Thanks & Cheers
Cheers,
-Chris