How Rittman Analytics Automates Project RAG Status Reporting using Vertex AI, DocumentAI, BigQuery & Looker

January 29, 2025 Mark Rittman

A common way to report on the progress of a project is through RAG (Red, Amber Green) status indicators for the project, at the overall level and by cost, schedule, scope, technology and data quality.

We’re currently piloting a Looker dashboard of this type to provide RAG status updates to our clients along with key project metrics and progress towards objectives, as shown in the screenshot below.

What’s interesting and is the subject of this blog is how we’re using generative AI, in the form of Google’s Vertex AI’s Gemini LLM, to assess each project engagement using a wide-ranging set of structured, semi-structured and unstructured data sources such as meeting transcripts, statements of work and project documentation.

Why RAG Status Reporting Is Hard to Automate

Creating a Looker dashboard that reports-on the number of tickets closed, amount of budget spent and other project metrics is relatively-straightforward but the numbers only tell part of the story.

What’s discussed in project meetings, in Slack messages and written in project documentation is the additional colour and context that’s used by a project manager when assigning RAG statuses to a project.

And of course all of those RAG statuses are set by reference to the project objectives, expected benefits and agreed deliverables contained within the project Statement of Work, typically a PDF document that’s easy for a project manager to read but usually impenetrable to a dashboarding tool.

To create our RAG status dashboard our approach was to source and centralize data from the following data sources and then use that data to prompt Vertex AI to classify the project engagement as either Red, Amber or Green RAG status:

Project metrics and KPI data from Harvest, Jira and Hubspot
Meeting transcripts from Fathom.video
Messages from Slack
Project documentation from Confluence Cloud
Statements of Work stored as PDFs and staged in Google Cloud Storage

In this blog we’ll focus on the unstructured and semi-structured data sources that were new to our internal BigQuery warehouse and are good examples of how you can use BQML functions to parse and analyze these types of non-conventional warehouse sources.

Let’s start with how we wrangled and added structure to our Fathom.video meeting transcripts, a source of project activity data that’s been key to this whole generative AI use-case.

Parsing & Structuring Fathom Meeting Transcripts

Like many organizations we’re big users of services such as Fathom.video that automatically transcribe meetings for you, and we’ve tried other service such as Fireflies.ai and Otter.ai in the past; crucially though for us, Fathom’s integrates with the Zapier service and enables us to set-up processing pipelines that can call other services, such as Google Vertex AI, and then store the final output in our Google BigQuery data warehouse.

When a new meeting transcript is published, we run that transcript raw text through a number of processing steps in a Zapier pipeline, analyzing the transcript’s overall sentiment, summarising it and then classifying the meeting as either an internal meeting or one of two client meeting types.

Each step in the pipeline sends a prompt to Vertex AI’s Gemini Flash-1.5 LLM to ask it to analyze the sentiment of the whole transcript, summarise it and classify it as either a client or internal meeting, with the raw transcript and all of the LLM-generated additions then written to a table in our Google BigQuery warehouse.

Taking just the raw transcript for now, we then process it using a set of dbt transformations as shown in the dbt Cloud lineage view below:

Each raw transformation is structured in pairs of two lines, the first line containing the time, speaker name and company name in brackets and the following line containing the words that person said, as shown in the example below:

We then parse this semi-structured data using the dbt staging-layer model below, using BigQuery SQL to split the raw text field content by newlines, group those lines together in pairs and then join them together to return one row per transcript speaker line.

See this content in the original post

This parsing not-only splits the meeting transcript into one row per-line of each meeting attendees’ words but it also parses-out the speaker names and their company names into their own table columns, which we use later-on in the pipeline to join these meeting conversations back to our central warehouse “contacts” dimension.

And now that we have the transcript in this more structured form, we can further process and aggregate the meeting contributions from each person through a series of dbt transformations as shown in the diagram below.

This BigQuery-powered dbt pipeline creates for us:

Per meeting attendee, a summary of what points they made in the meeting (their “contribution”) alongside a summary of the meeting as a whole, to help us understand the viewpoint, sentiment and engagement in that meeting for each attendee
Using all of a meeting attendees’ contributions that we aggregate into a single JSON string we then use BigQuery’s ML.GENERATE_TEXT function to pass these contributions to an LLM and ask it to return a JSON-formatted list of “jobs-to-be-done” for that person.

See this content in the original post

The JSON-formatted results of this prompt response are then transform into two BigQuery nested sets of columns, one containing their jobs-to-be-done and the other a list of keywords from those jobs as shown in the example output below.

You can read about other use-cases we’ve developed for BigQuery’s ML.GENERATE_TEXT function in our blogs on Automating your Contacts List Segmentation using Google BigQuery, Vertex AI and the ML.GENERATE_TEXT function and Automating Insights and Dashboard Report-Level Narratives using Vertex AI Gemini 1.5-Flash, Looker Studio and Google BigQuery BQML.

Confluence Storage Format Parsing using Gen AI

Another key source of data for assessing the progress of a project is the documentation we create as meeting notes and the final engagement deliverables.

Fivetran have a Confluence Cloud connector that extracts into BigQuery all of the page and comments data we’re looking for, but the raw form of the content it brings across is in a weird XHTML Confluence Storage Format that isn’t easily parsed by BigQuery’s JSON query functional; also all of the content is versioned and so needs filtering to return the most recent version

See this content in the original post

To parse this XHTML content we also used BigQuery’s ML.GENERATE_TEXT function, this time asking the Vertex AI LLM to first parse the XHTML-formatted data from Confluence and then summarise it, with those summaries then used later-on as parts of the input into the RAG status generation dbt model.

See this content in the original post

If you’re wondering, the various REPLACE() functions around the output are to get rid of the extraneous ```json and linebreaks added by the LLM that would otherwise stop BigQuery’s JSON_VALUE function from working.

Statement of Work PDF Parsing using DocumentAI

To usefully assess the current status and progress of a project for a client, we really need to know what the original requirements, business objectives and agreed deliverables were for that project, information that can be found in a project-specific document called a “Statement of Work”, a free-form loosely-structured document stored as a PDF document.

Our strategy to access this statement of work data is to load it as embeddings into a vector storage table and then have ML.GENERATE_TEXT access it as a RAG (Retrieval Augmented Generation, another type of “RAG”) semantic search source, using two more BQML functions, ML.PROCESS_DOCUMENT and ML.GENERATE_EMBEDDING.

The steps to implement this pipeline are implemented by two Python notebooks, one that is run once-only and sets the pipeline up (code example here) and a second that loops through all of the PDFs in Google Cloud Storage and, for all that have not yet been ingested, runs through the following steps (code example here)

Define a BigQuery Object Table over each SoW PDF file using the CREATE EXTERNAL TABLE DDL command

See this content in the original post

2. Parse, chunk and create embeddings for the PDF data using the BQML ML.PROCESS_DOCUMENT function and the DocumentAI Layout Parser processor (note that if, like us, you’re based in the UK and use the europe-west2 Google Cloud Platform region this feature is only available in the EU and US multi-regions, and so you’ll have to parse the documents in the EU region and then copy the final output table over to europe-west2 at the end to bring this data into europe-west2)

See this content in the original post

3. Generate vector embeddings for each of those parsed results tables using the ML.GENERATE_EMBEDDINGS function

See this content in the original post

4. And Finally, use the ML.GENERATE_TEXT function to create a prompt that uses the vector table as your RAG search source and outputs, in a structured form, the details in the engagement Statement of Work.

See this content in the original post

Looking now at the results of this processing pipeline you can see one row per engagement SoW PDF with the background, requirements, solution and then nested column sets for the objectives (including expected business benefit and measurement metric) and the contractual deliverables (with acceptance criteria).

In-total, including the pipelines to bring-in Slack messaging data and the various project KPI data sources such as Jira, Harvest and Hubspot, the complete processing pipeline for assessing the project engagement’s RAG status is shown in the diagram below.

Using ML.GENERATE_TEXT to Classify RAG Status

The final step in this process is to take all of the data gathered from the various project sources, concatenate the multi-row datasets they provide into JSON strings suitable for sending as part of an LLM prompt and then use them to create a series of prompts to the Gemini 1.5-FLASH LLM using BigQuery’s ML.GENERATE_TEXT function.

The full SQL query can be accessed as sample code here, but the key part of the command is where we concatenate all of these data sources into a prompt we send to the LLM, shown below:

See this content in the original post

The output of this SQL, run within our warehouse dbt package every six hours and stored as a monthly snapshot of each project engagement’s performance, is then used as the main data source for our Project RAG Status dashboard delivered through Looker as shown below.

Note also the visualizations in the bottom-half of the dashboard page where we use the sentiment and engagement levels calculated elsewhere in this pipeline to gauge the “mood” of the project at this point-in-time, and the wordcloud visualization at the bottom right-hand side of the page that shows in a visual form the topics most prominent in the various stakeholder jobs-to-be-done’s, also extracted and assessed via LLM in this processing pipeline.

INTERESTED? FIND OUT MORE!

Rittman Analytics is a boutique data analytics consultancy that helps ambitious, digital-native businesses scale-up their approach to data, analytics and generative AI.

We’re authorised delivery partners for Google Cloud along with Oracle, Segment, Cube, Dagster, Preset, dbt Labs and Fivetran and are experts at helping you design an analytics solution that’s right for your organisation’s needs, use-cases and budget and working with you and your data team to successfully implement it.

If you’re looking for some help and assistance with your AI initiative or would just like to talk shop and share ideas and thoughts on what’s going on in your organisation and the wider data analytics world, contact us now to organise a 100%-free, no-obligation call — we’d love to hear from you!