Dave: We've worked across many different industries, and we've consistently run into data quality hurdles, whether it's building profile matching systems, getting real-time transcription to actually work properly, making sure our automated assessments are consistent, validating information across massive content libraries, or implementing evaluation frameworks for continuous improvement and governance. Could you walk me through some of the technical approaches we're using to ensure data quality and model accuracy? We deal with such different contexts and applications, so what specific methodologies have proven most effective in your experience?

Justin: Evaluation is one of the techniques we use for ensuring data quality in terms of inputs and outputs. That's one of the key things you need to be doing - evaluating the data across a number of different characteristics.

1745571483291.jpeg

As part of our development process, we conduct an initial data quality assessment, which examines all the different data sources that will feed into an application and be used later as input into the models. We typically look at the consistency and integrity of the data. Whether it's written content or numerical content, has it been filled in, and appropriately validated, and does each record have a consistent amount of data across the fields you'll need to use? On a fundamental level, it's about asking the question: Is it fit for purpose?

Then there's the governance and compliance characteristics. Does the data have evidence that it has been collected in line with GDPR, and has it been anonymised in line with regulations like HIPAA, depending on the use cases? Some organisations are quite advanced in this area, while others are woefully not. Quite often, there's no evidence to suggest the provenance of the data - where it's come from, and what permissions have been given for it.

Increasingly, companies need to be aware as they pull more data out of their data lakes and start surfacing it into AI applications that the provenance and governance of that data are really important. The onus lies with the client as the data owner, but as data processors ourselves, we must flag any issues and ensure the data is compliant for its intended uses.

Once you've completed the initial stages, you then start evaluating the quality of the data for input into your process. That's where you begin doing initial statistical testing for things like gender bias, inappropriate language, and so on. There are numerous tools we utilise for this purpose. It may be a simple distribution analysis to identify the number of potentially biased words in a dataset, indicating the dataset's overall bias.

We can also test at an embedding level. Once the data has been ingested and given embeddings by passing it through a model, those embeddings can potentially introduce bias if there is any bias in the underlying model - it may have mapped more male-influenced terms, for example, to the data than is appropriate. You also need to check at that stage, again using distribution models.

At this point, if you suspect inherent bias, you're looking at introducing human evaluation - going back to clients to ensure they understand the risks, and then literally having humans review that data and label it so it can be cleaned up, re-ingested, and proceed from there. There are several automated tools available to assist with this task. IBM developed a library years ago that's often used as the underlying bias checker for inputs and outputs.

Dave: You've talked about the importance of validating data consistency, checking governance compliance, and testing for bias at multiple stages. I'm interested in how these challenges manifest differently across our various client sectors. When we look at industries like healthcare with HIPAA concerns, recruitment with their varied survey approaches, or intelligence services with real-time processing needs, what are the main challenges we typically encounter when maintaining data integrity and mitigating bias across these varied datasets?

Justin Boynton: The challenge often lies in the way data is collected in the first place. With a recruitment client of ours, for example, they ask similar questions but in different ways across various surveys. When you come to use that data, you've got very similar questions and answers, but it's hard to establish any consistency. It's difficult to determine whether bias has been introduced by changes in the question's tone or how the answers were presented.

We've seen that, without a complete picture of how these inconsistencies can ripple through and affect the quality of downstream processes and insights, it's a challenge for organisations to anticipate the impact. This is why we're increasingly focused on developing comprehensive frameworks and partnering with our clients to navigate the different stages of data quality maturity.

For a recent healthcare project, we had to ensure all HIPAA and GDPR compliance was in place for data both within and outside their system, and that historical data being imported had been compliance-checked. This project also included several fact-checking guardrails on the output from the models to ensure compliance.

With a current intelligence client, where we’ve developed an AI system to record, translate, and transcribe sessions in near real-time, it's more of an objective process that gets refined over time. The accuracy of transcriptions can only really be validated by the translators and editors using the system. It's difficult to use another transcription system as a check because they're all at similar maturity levels, and none are 100% accurate.

Once you're in the application stage, data quality is handled by a combination of guardrails. Human input is checked for bias, personal information, inappropriate language, etc., and is either blocked or anonymised before it's passed to the models. Similarly, output from the LLM is passed through those same guardrails, and inappropriate content is flagged. Your system can then choose how to handle it - display it back to the user with a warning, or prompt the model again to perform the same task but remove the problematic elements. It depends on what the application is.

The final stage is evaluation, where observability in your application records all input and output data and evaluates it against defined metrics. Overall, this function or tool call should produce results within a specific statistical range or without personal information. You're constantly monitoring these metrics to ensure the algorithms perform as expected, and these KPIs serve as a warning system when code changes are introduced.

Most systems we implement include feedback mechanisms in both admin areas and a customer-facing UI where users can report inappropriate results or content. That data is collected and used to fine-tune the model and application. Human evaluation can be subjective, or statistical tools like the Likert scale can be used to make it more academically robust. For regulated industries like insurance and legal, you need that more rigorous approach.

The more users your system has, the better overall fairness distribution you achieve across your reinforced human learning. With only a couple of people doing the evaluation, you're subject to their biases. It works better when the system is evaluated and used by a larger number of people, because we are all inherently biased in one way or another.

1745571504020.jpeg

Dave: That's really comprehensive about how guardrails and evaluation systems work across different sectors. You mentioned client maturity levels regarding data governance, and how your approach varies between longstanding clients where you've had years to develop processes versus new clients where you need to move quickly. What do you think sets our expertise apart when tackling these data quality challenges, especially when onboarding new clients who might not have the same data governance awareness?

Justin: We've been lucky to have had quite a lot of control over our clients' data to date. With one of our longstanding research clients, we've had years of development time where we've been putting processes in place to clean up and validate the data outside of its use within AI. We've organically developed a layer of governance in that project, in collaboration with the client who's responsible for downstream processes when gathering data in the first place.

We're needing to move ever more quickly with new projects where we haven't had those two or three years of embedded time to naturally carry out these processes. As a result, we're building a more robust onboarding approach. This means clearly outlining the steps we'll go through, what clients need to provide, and the value and benefit of making this part of the project process. That's how we're approaching it now with new clients.

What makes our solutions particularly effective is putting data quality at the heart of what we're doing. It's basically a case of ‘garbage in, garbage out’ - even more so with language models. If you're building observability and evaluation into what you're doing from the start and have a clear way of measuring data quality and improvements over time, that's the secret sauce to getting it right. It's just about being aware of it and having a robust framework for dealing with it.