Understand Code Integrity Risks for LLM-Produced Content

July 6, 2023 | By Aaron Turner, IANS Faculty

The copyright and intellectual property risks that enterprises inherit through using code engines powered by Large Language Models (LLM) is currently tremendous, and organizations should avoid allowing developers to use LLMs for core technologies. We suggest implementing strict guidance to all LLM users about the confidentiality risks that are associated with the use of LLMs and create policies, processes and controls for sensitive data. 

This piece explains how the terms of service of the major players in the LLM ecosystem, including ChatGPT, do not provide adequate protections for LLM content, nor provable and auditable data segmentation controls, and their intellectual property policies are more favorable to the LLM platform operator than the end user.

LLM Risk: Data Integrity and Confidentially 

ChatGPT and other LLMs have disrupted nearly every corner of the technology ecosystem. From an information security perspective, there are important lessons to be learned about the integrity of data that enterprises use as a result of LLM output. Equally important are considerations around the confidentiality of input into LLMs. Presently, there is a complete lack of maturity around LLM terms of service, content segmentation and intellectual property concerns for enterprises to integrate LLMs into daily technology operations.

LLM Content Policies: Code Integrity 

Within the terms of service of LLMs, the concepts of Input (what the system is prompted to do by the user) and Output (what the LLM returns to the user) are generally combined to be called “Content.” In the OpenAI terms of use published as of this writing, OpenAI assigns users all its rights, title and interest in and to output. As far as Input, ChatGPT and DALL-E users agree to let OpenAI use their Input to improve OpenAI’s models. At present, ChatGPT claims that all API usage of its services does not allow for Input to be used to improve OpenAI’s model.

READ: Exploring the Business Risks and Challenges of ChatGPT

For Microsoft services that use OpenAI modules to provide the service, such as GitHub Copilot, Content ownership gets a bit murkier. For example, in the Free, Pro & Team versions of GitHub Copilot, the output is called “Suggestions” and states, “GitHub does not claim any rights in Suggestions, and you retain ownership of and responsibility for Your Code, including Suggestions you include in Your Code.” The sticking point comes with the way that Copilot collects input as Microsoft reserves the right to collect snippets of ‘Your Code’ and will collect additional usage information through the integrated development environment or editor tied to your account.

In correlating these two statements in the terms of use, a fairly clean license is given to the user to use any code ‘Suggestions’, but Microsoft also reserves the right to share code gathered with others.

This causes a significant concern for the general counsel of the organization. Supposing that an organization is working on a logistics improvement application, and Microsoft discovers certain aspects of logistics optimization through Copilot, it is within Microsoft’s rights to commercialize that logistics optimization technology to others or through Microsoft’s own platforms like Dynamics.

Developers Using LLMs for Core Technologies 

With the intellectual property rights terms for Microsoft and OpenAI for any Content, Suggestions, etc., the AI risks that enterprises inherit through using LLM-powered code engines is currently tremendous. Eventually organizations could start to see terms and conditions from organizations requiring for the disclosure of the use of LLMs in the development of all technologies. Organization that relied on LLMs to produce code could inherit all of the intellectual property risks associated with the use of Content/Suggestions from those systems.

To clarify, any code that a developer inputs into an LLM should be considered as shared publicly. Also, any code provided as Output or Suggestions should be considered as tainted as OpenAI nor Microsoft make any representation that the code can be used without impacting anyone who may have previously copyrighted or protected that code.

Confidentiality of LLM Input 

Organizations should have strict guidance for all LLM users about the confidentiality risks that are associated with the use of LLMs. OpenAI was recently forced to disclose a bug that allowed for users to gain access to the prompts of other users, including very large training data sets that had to be sent to ChatGPT as prompts. At present, there are no provable data protection models for any data that is fed into LLMs as prompts or training data. This includes the newly announced Azure OpenAI Service, which has complicated documentation for its terms of use, based on a circular reference loop to different Azure Product Licensing and Subscription Agreements.

READ: ChatGPT: Uncovering Misuse Scenarios and AI Security Challenges

Create LLM Policies, Processes and Controls for Sensitive Data 

Unfortunately, there are very few automated controls, which will allow for the complete and comprehensive blocking of OpenAI service from enterprise networks. It will become increasingly important for technology teams to create clear policies, educate users in regard to appropriate processes and implement controls where possible.

Using LLM to Improve Security Policies 

LLMs are incredible technology platforms that have the potential to provide real performance and efficiency benefits to technology teams ranging from developers to customer support. The danger in their current iterations lies in the fact that users have little control over the use of any data that is input into the systems, and that any output could put code bases and platforms at risk of intellectual property rights disputes.

One of the few bright spots for the use of LLMs by security teams is through their use to improve security policies. For example, if a security team inputs an anonymized incident response policy into an LLM and then asks it for suggestions on how to improve efficiency or flexibility, the LLM will likely make good suggestions. The lack of confidentiality for the input into the system will create a significant bit of overhead for organizations to train users on how to appropriately anonymize input into LLMs.

Although reasonable efforts will be made to ensure the completeness and accuracy of the information contained in our blog posts, no liability can be accepted by IANS or our Faculty members for the results of any actions taken by individuals or firms in connection with such information, opinions, or advice.

Access time-saving tools and helpful guides from our Faculty.

IANS + Artico Search

2023 Security Budget Benchmark Report

Get New IANS Blog Content
Delivered to Your Inbox

Please provide a business email.