Is Your Data Leaking via ChatGPT?

Published 06/07/2023

Originally written and published by Code42.

In November 2022, OpenAI released ChatGPT, a generative artificial intelligence (GAI) tool, which has since taken the world by storm. Only two months after its launch, it had over 100 million users, making it “the fastest-growing consumer application in history.” And ChatGPT is not alone.

While any new technology presents risks, it doesn’t mean the risks are all brand new. In fact, companies might find that they already have many of the people, processes, and technology in place to mitigate the risks of GAI tools but need to fine-tune them to include this new technology category.

That being said, companies should not assume they are already protected from these risks; instead, they should be intentional in their risk mitigation.

How might GAI risk sensitive information?

Most GAI tools are designed to take a “prompt” from the user, which might be entered/pasted text, a text file, an image, live or recorded audio, a video, or some combination of them and then perform some action based on the prompt used. ChatGPT, for example, can summarize the content of a prompt (“summarize these meeting notes…”) and debug source code (“identify what’s wrong with this code…”).The risk is that these prompts can be reviewed by the GAI tool developers and might be used to train a future version of the tool. In the case of public tools, the prompts become part of the language model itself, feeding it content and data it would not otherwise have. If these prompts only include information that’s already public, there may be little risk; however, in the example above, if the meeting notes included sensitive information or the source code was proprietary, this confidential data has now left the control of the security team and potentially put compliance obligations and intellectual property protections at risk.

In the most extreme situation, someone’s input might become someone else’s output.

And that’s not only theoretical.

In January 2023, Amazon issued an internal communication restricting ChatGPT use after it discovered ChatGPT results “mimicked internal Amazon data.” And while the initial data used to train ChatGPT stopped being collected in 2021, ChatGPT isn’t done training. In fact, in the future, when an engineer asks for help optimizing their source code it might look just like the proprietary data from Samsung’s semiconductor division.

With one report indicating two-thirds of jobs could be at least partially automated by GAI tools, what other sensitive information might end up being used as training data?

It’s all in the balance

An intentional, balanced approach across people, process, and technology is the only way to truly mitigate the risk of data leaking via a GAI tool.

Assemble your stakeholders

Start by getting your core stakeholders together: Human Resources, Legal, IT, and Security to define the policies and procedures for GAI tool use in alignment with executive guidance.

Who can use it when?
What can be used as input?
How can the output be used?

Some companies, like Amazon, choose to allow GAI tool use as long as their employees are careful; others, like Samsung, have decided to ban its use altogether.

Communicate and educate

Provide explicit guidance in your Acceptable Use Policy
Require proactive training on the risks associated with GAI tool use
Let employees know where they can go for assistance on proper use and how to report improper use

Stop, contain and correct

Monitor employee activity to untrusted GAI tools
Block paste activity into unapproved GAI tools
Generate alerts for file uploads to GAI tools
Respond with real-time training for those employees who make a mistake

Many of the risks of GAI tools aren’t new, and with an intentional, holistic approach organizations mitigate GAI tool risk without losing the momentum of possibility.