Sentiment analysis in your own PDF documents with ChatGPT
While preparing some documents for a forthcoming academic review, I was asked by my department chair to go through all of the student comments from every course I've taught in the last 6 years and find a few "positive" comments that he could quote in his summary writeup.
As I usually teach 3 courses per year, sometimes with cross-listings in multiple departments and/or undergrad/grad sections. All in, this resulted in 32 PDF documents each with many student comments that I needed to draw from. Doing it manually, would require opening each document, reading and/or compiling the comments and then evaluating and choosing the "most positive" comments. I decided to use ChatGPT with a text embedding to assit me in this task, and thought the code might be useful to others so I am sharing it below along with some comments and documentation.
First we start with the package imports. I'm heavily using langchian which provides abstractions for using large language models (LLMs) and tooling to easily "chain together" tasks such as reading in text, creating a vectorstore of the text embedding to be passed to a LLM along with a prompt which a specific question or instruction.
import os
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from import SystemMessage, HumanMessagePromptTemplate
from IPython.display import display, Markdown
We're going to use ChatGPT from OpenAI so we'll need to supply an API key. This video tutorial demonstrates how to aquire an API key. If you'd like to use the code below, you'll need to uncomment and paste your API key in the string to the right of the equals sign below.
#os.environ['OPENAI_API_KEY'] = "<your API key here>"
Now we'll define a few helper functions to help us find PDFs in a given directory and then parse the text out of them. The get_pdf_text
function below combines all the text from all the PDFs into a single text string.
def find_pdf_files(directory_path):
pdf_files = []
for root, dirs, files in os.walk(directory_path):
for file in files:
if file.endswith('.pdf'):
return pdf_files
def get_pdf_text(pdf_files):
text = ""
for pdf in pdf_files:
pdf_reader = PdfReader(pdf)
for page in pdf_reader.pages:
text += page.extract_text()
return text
Next we'll use the CharacterTextSpitter
class from langchain to take the continuous string of text from all the PDFs and turn them into chunks of texts which is needed to create a vectorstore text embedding. Text ebeddings measure the relatedness of text strings.
def get_text_chunks(raw_text):
text_splitter = CharacterTextSplitter(
separator = '\n',
chunk_size = 2000,
chunk_overlap = 500,
length_function = len
chunks = text_splitter.split_text(raw_text)
return chunks
Here we create the vectorstore using the OpenAIEmbeddings
class. While we are using an OpenAI embedding here, it's not required. Langchain provides nice abstractions that allow for using different embeddings model with ChatGPT.
def get_vectorstore(chunks):
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_texts(texts=chunks, embedding=embeddings)
return vectorstore
Below we create the converstation chiain which combines a prompt template with the text embedding and allows for users querys. Prompt templates can increase the accuracy of LMM responses. Special thanks to Jeremy Howard and his Twitter thread from which I took his custom instructions as a system prompt template.
def get_conversation_chain(vectorstore):
template = ChatPromptTemplate.from_messages(
"You are an autoregressive language model that has been fine-"
"tuned with instruction-tuning and RLHF. You carefully "
"provide accurate, factual, thoughtful, nuanced answers, and"
"are brilliant at reasoning. If you think there might not be "
"a correct answer, you say so. Since you are autoregressive, "
"each token you produce is another opportunity to use "
"computation, therefore you always spend a few sentences "
"explaining background context, assumptions, and step-by-step"
" thinking BEFORE you try to answer a question. Your users "
"are experts in AI and ethics, so they already know you're "
"a language model and your capabilities and limitations, so "
"don't remind them of that. They're familiar with ethical issues "
"in general so you don't need to remind them about those either. "
"Don't be verbose in your answers, but do provide details and "
"examples where it might help the explanation."
llm = ChatOpenAI()#model_name="gpt-4")
memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)
conversation_chain = ConversationalRetrievalChain.from_llm(
return conversation_chain
Finally, the function below combines all the helper functions and returns a chat chain that we can ask questions based on the content contained in our PDFs.
def create_chat_from_pdfs(directory):
pdfs = find_pdf_files(directory)
raw_text = get_pdf_text(pdfs)
chunks = get_text_chunks(raw_text)
vs = get_vectorstore(chunks)
chain = get_conversation_chain(vs)
return chain
Here we initialize the chain to read in any PDFs in the current working directory.
chat_chain = create_chat_from_pdfs(".")
Now we use the chain and our custom text embedding to find 15 of the "most postive and supportive" student comments giving preference to those that use the phrase "Dr. Foster". Finally we print out the results of the query using Markdown to format the list in rich text.
result ="In the given text, there are many student comments "
"preceeded with the word RESPONSE in capital letters. "
"Choose 15 of the most positive and supportive "
"student comments for the instructor and course. "
"Give preference to the comments that call out "
"Dr. Foster by name")
While you don't have access to the original PDFs, I can assure you that these comments have been parsed from them and quite responsive to the prompt I used. They are not all "perfectly" responsive, but I really only need 2-3 comments, so I asked for 15 so that I can then inspect them to downselect and choose the ones I'd like to provide.
You should be able to adapt this code to your own use cases.