Skip to main content

Sentiment analysis in your own PDF documents with ChatGPT

While preparing some documents for a forthcoming academic review, I was asked by my department chair to go through all of the student comments from every course I've taught in the last 6 years and find a few "positive" comments that he could quote in his summary writeup.

As I usually teach 3 courses per year, sometimes with cross-listings in multiple departments and/or undergrad/grad sections. All in, this resulted in 32 PDF documents each with many student comments that I needed to draw from. Doing it manually, would require opening each document, reading and/or compiling the comments and then evaluating and choosing the "most positive" comments. I decided to use ChatGPT with a text embedding to assit me in this task, and thought the code might be useful to others so I am sharing it below along with some comments and documentation.

First we start with the package imports. I'm heavily using langchian which provides abstractions for using large language models (LLMs) and tooling to easily "chain together" tasks such as reading in text, creating a vectorstore of the text embedding to be passed to a LLM along with a prompt which a specific question or instruction.

In [1]:
import os
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.prompts.chat import SystemMessage, HumanMessagePromptTemplate

from IPython.display import display, Markdown

We're going to use ChatGPT from OpenAI so we'll need to supply an API key. This video tutorial demonstrates how to aquire an API key. If you'd like to use the code below, you'll need to uncomment and paste your API key in the string to the right of the equals sign below.

In [2]:
#os.environ['OPENAI_API_KEY'] = "<your API key here>"

Now we'll define a few helper functions to help us find PDFs in a given directory and then parse the text out of them. The get_pdf_text function below combines all the text from all the PDFs into a single text string.

In [3]:
def find_pdf_files(directory_path):
    pdf_files = []
    for root, dirs, files in os.walk(directory_path):
        for file in files:
            if file.endswith('.pdf'):
                pdf_files.append(file)
    return pdf_files

def get_pdf_text(pdf_files):
    text = ""
    for pdf in pdf_files:
        pdf_reader = PdfReader(pdf)
        for page in pdf_reader.pages:
            text += page.extract_text()
    return text    

Next we'll use the CharacterTextSpitter class from langchain to take the continuous string of text from all the PDFs and turn them into chunks of texts which is needed to create a vectorstore text embedding. Text ebeddings measure the relatedness of text strings.

In [4]:
def get_text_chunks(raw_text):
    text_splitter = CharacterTextSplitter(
        separator = '\n',
        chunk_size = 2000,
        chunk_overlap = 500,
        length_function = len
    )
    chunks = text_splitter.split_text(raw_text)
    return chunks

Here we create the vectorstore using the OpenAIEmbeddings class. While we are using an OpenAI embedding here, it's not required. Langchain provides nice abstractions that allow for using different embeddings model with ChatGPT.

In [5]:
def get_vectorstore(chunks):
    embeddings = OpenAIEmbeddings()
    vectorstore = FAISS.from_texts(texts=chunks, embedding=embeddings)
    return vectorstore

Below we create the converstation chiain which combines a prompt template with the text embedding and allows for users querys. Prompt templates can increase the accuracy of LMM responses. Special thanks to Jeremy Howard and his Twitter thread from which I took his custom instructions as a system prompt template.

In [6]:
def get_conversation_chain(vectorstore):
    template = ChatPromptTemplate.from_messages(
      [
        SystemMessage(
            content=(
                "You are an autoregressive language model that has been fine-"
                "tuned with instruction-tuning and RLHF. You carefully "
                "provide accurate, factual, thoughtful, nuanced answers, and"
                "are brilliant at reasoning. If you think there might not be "
                "a correct answer, you say so. Since you are autoregressive, "
                "each token you produce is another opportunity to use "
                "computation, therefore you always spend a few sentences "
                "explaining background context, assumptions, and step-by-step"
                " thinking BEFORE you try to answer a question. Your users "
                "are experts in AI and ethics, so they already know you're "
                "a language model and your capabilities and limitations, so "
                "don't remind them of that. They're familiar with ethical issues "
                "in general so you don't need to remind them about those either. "
                "Don't be verbose in your answers, but do provide details and "
                "examples where it might help the explanation."
            )
        ),
        HumanMessagePromptTemplate.from_template("{text}"),
      ]
    )
    llm = ChatOpenAI()#model_name="gpt-4")
    memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)
    conversation_chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        condense_question_prompt=template,
        retriever=vectorstore.as_retriever(),
        memory=memory
    )
    return conversation_chain
    

Finally, the function below combines all the helper functions and returns a chat chain that we can ask questions based on the content contained in our PDFs.

In [7]:
def create_chat_from_pdfs(directory):
    pdfs = find_pdf_files(directory)
    raw_text = get_pdf_text(pdfs)
    chunks = get_text_chunks(raw_text)
    vs = get_vectorstore(chunks)
    chain = get_conversation_chain(vs)
    return chain

Here we initialize the chain to read in any PDFs in the current working directory.

In [8]:
chat_chain = create_chat_from_pdfs(".")

Now we use the chain and our custom text embedding to find 15 of the "most postive and supportive" student comments giving preference to those that use the phrase "Dr. Foster". Finally we print out the results of the query using Markdown to format the list in rich text.

In [9]:
result = chat_chain.run("In the given text, there are many student comments "
                        "preceeded with the word RESPONSE in capital letters. "
                        "Choose 15 of the most positive and supportive "
                        "student comments for the instructor and course. "
                        "Give preference to the comments that call out "
                        "Dr. Foster by name")
display(Markdown(result))
  1. RESPONSE: Professor Foster is very knowledgeable on the subject. Homeworko?=s are hard but they were doable. Great lecturer
  2. RESPONSE: Foster is a very efficient professor. He uses technology to create new ways for us to interact with the material. It is very clear that he prefera to spend his time researching. He is a smart guy, let him do his research. Dono?=t waste his talents in the classroom. I am sure the department can hire a cheap lecturerer to do the teaching.
  3. RESPONSE: Dr. Foster's lectures were great, I learned a lot from his lectures videos. He's easy to communicate with, definitely ask him for help if you need it!
  4. RESPONSE: Dr. Foster is very knowledgeable and has a deep understanding of programming. I wish that the course could have implemented oil and gas data similar to Dr. Pyrcz courses so that we get a feel of what we're working with and how to manipulate it. But overall, the course material is very thoroughly thought out and taught me a lot. Thank you for this course.
  5. RESPONSE: Dr. Foster is a great and very knowledgeable professor. My only constructive criticism is that I struggled to put everything together in the bigger picture with our assignments/material
  6. RESPONSE: I do think that I learned a lot in this class. The days that you walked the class through the assignment and explained what was happening were the days that I learned the most.
  7. RESPONSE: Dr. Foster is very knowledgeable and has a good teaching style. I really appreciated his lectures and found them very helpful. He also responded quickly to any questions I had. Overall, I had a great experience in this course.
  8. RESPONSE: Professor Foster is a fantastic instructor. He is very knowledgeable and passionate about the subject matter, which made the lectures engaging and interesting. He was also very approachable and always willing to help. I would highly recommend taking a course with him.
  9. RESPONSE: Dr. Foster is an excellent professor. He is very knowledgeable and presents the material in a clear and organized manner. I found his lectures to be very helpful and he was always available to answer any questions or provide additional clarification. I learned a lot in this course and would definitely recommend it.
  10. RESPONSE: I really enjoyed this course with Dr. Foster. He is very knowledgeable and presents the material in a way that is easy to understand. The lectures were engaging and I learned a lot. Dr. Foster was also very approachable and always willing to help. I would definitely take another course with him.
  11. RESPONSE: Dr. Foster is a great instructor. He is very knowledgeable and explains the material in a way that is easy to understand. I really appreciated his willingness to help and his responsiveness to any questions or concerns. Overall, I had a positive experience in this course and would recommend it to others.
  12. RESPONSE: I found Dr. Foster to be a great professor. He is very knowledgeable and passionate about the subject matter, which made the lectures interesting and engaging. He was also very approachable and always willing to help. I learned a lot in this course and would highly recommend it.
  13. RESPONSE: Professor Foster is an excellent instructor. He is very knowledgeable and presents the material in a clear and organized manner. I found his lectures to be very helpful and he was always available to answer any questions or provide additional clarification. Overall, I had a great experience in this course and would definitely recommend it.
  14. RESPONSE: Dr. Foster is a great professor. He is very knowledgeable and presents the material in a way that is easy to understand. I really enjoyed his lectures and found them to be very helpful. He was also very approachable and always willing to help. I would highly recommend taking a course with him.
  15. RESPONSE: I had a great experience in this course with Dr. Foster. He is very knowledgeable and presents the material in a way that is easy to understand. I found his lectures to be engaging and informative. He was also very responsive to any questions or concerns. Overall, I learned a lot and would definitely recommend this course.

While you don't have access to the original PDFs, I can assure you that these comments have been parsed from them and quite responsive to the prompt I used. They are not all "perfectly" responsive, but I really only need 2-3 comments, so I asked for 15 so that I can then inspect them to downselect and choose the ones I'd like to provide.

You should be able to adapt this code to your own use cases.

Comments

Comments powered by Disqus