Text Mine ProQuest with ChatGPT

On June 13, 2025, ProQuest announced that the popular text-mining environment TDM Studio now includes a beta feature that lets users integrate GPT models into their R or Python workbench notebooks.

TDM Studio, available at no cost to researchers at Columbia, opens up the ProQuest databases to large-scale analyses of the full-text corpora. It comes in two flavors, a “no-code” visualization mode and a “workbench” mode that requires some familiarity with Python or R. Many researchers at Columbia have used TDM Studio in the past few years, particularly to analyze ProQuest’s historical newspapers and news & newspapers databases.

While it is possible to upload (like from Hugging Face) text analysis models to TDM Studio workbenches, the analysis must be done locally, using TDM Studio computation resources. With this new feature, researchers can use a prompt and have OpenAI’s servers compute a response. Because the TDM Studio workbench effectively has no connection to the internet, it is only through ProQuest’s feature that researchers can make these API calls to the OpenAI GPT servers.

Anyone at Columbia can sign up for a TDM Studio account, and if you have any questions, we encourage you to reach out to Research Data Services: data@library.columbia.edu. We can provide further guidance.

Finally, ProQuest has compiled a video for prospective users of the chat integration. After watching the video, as you can see, the implementation is straightforward for users familiar with Python. We create a prompt and send it, with our dataset, to OpenAI, which returns a response of some sort. In the video, the model is used for sentiment analysis, but the possibilities are limited only by the GPT model’s own limitations.

Researchers have access to the following models:

Use of the GPT LLMs is limited to ten requests per second and 50,000 tokens per minute. Users are also limited to $5 worth of compute a day.

Model In $ / 1k Tok Out $ / 1k Tok
gpt_4o 0.000005 0.000015
gpt_4o_2024_08_06 0.0000025 0.00001
gpt_4o_mini 0.00000015 0.0000006
o1_mini 0.000003 0.0000012
o1_preview 0.000015 0.00006

The more sophisticated the model, the more expensive it is, with output being more expensive than input.

Please do not hesitate to contact Research Data Services with questions about TDM Studio or any of your research data needs.