GPT for text manipulation tasks

August 29, 2023

This summer (2023) I tried out some generative large language models for a couple of text manipulation tasks that have been plaguing me for some time: converting text citations into machine readable bibtex for importing into citation managers like Zotero, and extracting bits of text data from semi-structured text (my thesis project.)

tl;dr: It can work, sort of. But I don't think I'll be using any of the models regularly for this sort of task.

For my experiments, I used ChatGPT 3.5 (mostly the May 2023 version), the one freely available on the web, and I tried out several downloadable models in GPT4All.io (available for Windows, Mac, & Linux.) Part of the experiment was using easily available tools, so, yes, I know that I can access ChatGPT 4 using a paid for account and/or paying to use the API with something like python. I would probably get better results with those, but they aren't something that I can simply recommend to the average student or faculty member without more background or training. Another caveat, the local models require a LOT of processing power to work well, so that probably affected my results as I was using a several-years-old MacBook Pro with only average processing power.

The citation task is something that comes up a lot in my librarian work, especially when introducing someone to a citation management program, like Zotero. Everyone has text-based citations laying around, in papers written or read, but you can't just copy and paste them into the programs. If you have IDs, like ISBNs or DOIs, you can paste those in, but for citations that don't have those handy shortcuts, you are generally limited to searching for an importable record in a library database or search engine, or manually entering the info. I wanted to see if generative AI could assist with this. Besides formatted text citations (mostly APA, but some other formats) I also tried pasting in info from a Scopus CSV download – because that was a specific question that I'd gotten earlier.

ChatGPT did pretty well. It recognized the bibtex format and could reproduce it. It often needed me to identify (or correct) the format (book, conference paper, article, etc.) but it takes suggestions and spits out corrected feedback. The difficulty came with the “generative” part of the model – it makes things up. It “knows” that article citations are supposed to have DOIs, so it added them. They look fine, but aren't always real. It also made up first names from authors' first initials. It mostly obeyed when I instructed it to NOT add any additional info than what was in the citations, but that would have to be something double-checked for every citation. Which is fine if you are doing a handful, but a tedious task if you are doing hundreds. It did work well enough to add to my Citations from Text Protocol document (https://dx.doi.org/10.17504/protocols.io.bp2l6bkq5gqe/v2) with a comment about the incorrect additions.

The various models in GPT4All didn't do so well. Most of the available models couldn't recognize or produce bibtex format. Several only seemed to be able to reproduce something similar to APA style, while others just repeated back whatever I had put in the prompt. This is mostly likely a factor from the training database – I'm quite sure that if the training data included more citation formats, at least one of these models would be capable of doing this task.

The 2nd task is one left over from my thesis work (https://doi.org/10.5281/zenodo.5807765) compiling ecological data from a series of DEEP newsletters. Within the paragraphs of the newsletters are sightings data for gamefish in Long Island Sound, often with locations and sometimes measurements. At the time, I queried a number of people doing similar work with text-mining programs but couldn't find anything that could do this sort of extraction. This seemed like something an LLM should be able to do.

Again, ChatGPT did better than the locally run models, probably (again) due to the processing power available. The GPT4All models mostly extracted the info correctly, but they couldn't format the info in the way I needed it. ChatGPT was able to do that, and was even able to handle text from older documents with faulty OCR. But, it was irregular. I sometimes got different results from the same text. And it never found everything that I pulled out manually. ChatGPT did not insert additional info in any of my tests. This is a task that I would be curious to try with a more advanced model.

While my test results were lackluster, I see promise in this work. But not enough promise, right now, to be confident using the programs or to counter the ethical dilemmas inherent in the current models. OpenAI (the producer of ChatGPT) employs thousands of low wage workers, mostly in the Global South, to screen inputs for its model, and uses a huge amount of energy and water to run and cool its servers. This is something that anyone seriously contemplating using ChatGPT, or other LLMs and generative AI models, for research tasks should consider.