rhedreen

Random writings

What is “AI”?

  • Semantic Analysis vs. LLMs
  • Image analysis vs. Art/graphics generators
  • self-contained vs. open to internet

What can they do?

  • LLMs generate “plausible” text: likely vs accurate. This means that all results should read well and not have many obvious flaws – the flaws may be *much* less obvious!
  • LLMs have a “temperature” setting that affects how tied to training data they are.
  • Prompt wording is important – use the prompt to set the circumstances, the type of results, the audience, and any details. (Note: some models will ignore things that they can’t handle; try rewording, or a different model/platform.)
  • All generator systems are dependent on the training data; bias in/bias out. Commons errors are repeated; common stereotypes are emphasized.

These are **Tools**

Ethics

  • training data/copyright
  • training processes
  • privacy
  • “replacing” humans

Learning more

  • plenty of free webinars and short courses/workshops available
  • beware both hype and doom
  • try things out!

Tools to try

Went for a New Years walk in the local park to look for New Years inspiration (stuff to think about; my preference rather than resolutions.) The things that caught my eye were a very large tree that came down due to wet soil where the local river has been overfull, and this shimmery reflection of trees in the river. A reminder that change is ever-present, but it can be catastrophic or beautiful, or even both at the same time.

This summer (2023) I tried out some generative large language models for a couple of text manipulation tasks that have been plaguing me for some time: converting text citations into machine readable bibtex for importing into citation managers like Zotero, and extracting bits of text data from semi-structured text (my thesis project.)

tl;dr: It can work, sort of. But I don't think I'll be using any of the models regularly for this sort of task.

For my experiments, I used ChatGPT 3.5 (mostly the May 2023 version), the one freely available on the web, and I tried out several downloadable models in GPT4All.io (available for Windows, Mac, & Linux.) Part of the experiment was using easily available tools, so, yes, I know that I can access ChatGPT 4 using a paid for account and/or paying to use the API with something like python. I would probably get better results with those, but they aren't something that I can simply recommend to the average student or faculty member without more background or training. Another caveat, the local models require a LOT of processing power to work well, so that probably affected my results as I was using a several-years-old MacBook Pro with only average processing power.

The citation task is something that comes up a lot in my librarian work, especially when introducing someone to a citation management program, like Zotero. Everyone has text-based citations laying around, in papers written or read, but you can't just copy and paste them into the programs. If you have IDs, like ISBNs or DOIs, you can paste those in, but for citations that don't have those handy shortcuts, you are generally limited to searching for an importable record in a library database or search engine, or manually entering the info. I wanted to see if generative AI could assist with this. Besides formatted text citations (mostly APA, but some other formats) I also tried pasting in info from a Scopus CSV download – because that was a specific question that I'd gotten earlier.

ChatGPT did pretty well. It recognized the bibtex format and could reproduce it. It often needed me to identify (or correct) the format (book, conference paper, article, etc.) but it takes suggestions and spits out corrected feedback. The difficulty came with the “generative” part of the model – it makes things up. It “knows” that article citations are supposed to have DOIs, so it added them. They look fine, but aren't always real. It also made up first names from authors' first initials. It mostly obeyed when I instructed it to NOT add any additional info than what was in the citations, but that would have to be something double-checked for every citation. Which is fine if you are doing a handful, but a tedious task if you are doing hundreds. It did work well enough to add to my Citations from Text Protocol document (https://dx.doi.org/10.17504/protocols.io.bp2l6bkq5gqe/v2) with a comment about the incorrect additions.

The various models in GPT4All didn't do so well. Most of the available models couldn't recognize or produce bibtex format. Several only seemed to be able to reproduce something similar to APA style, while others just repeated back whatever I had put in the prompt. This is mostly likely a factor from the training database – I'm quite sure that if the training data included more citation formats, at least one of these models would be capable of doing this task.

The 2nd task is one left over from my thesis work (https://doi.org/10.5281/zenodo.5807765) compiling ecological data from a series of DEEP newsletters. Within the paragraphs of the newsletters are sightings data for gamefish in Long Island Sound, often with locations and sometimes measurements. At the time, I queried a number of people doing similar work with text-mining programs but couldn't find anything that could do this sort of extraction. This seemed like something an LLM should be able to do.

Again, ChatGPT did better than the locally run models, probably (again) due to the processing power available. The GPT4All models mostly extracted the info correctly, but they couldn't format the info in the way I needed it. ChatGPT was able to do that, and was even able to handle text from older documents with faulty OCR. But, it was irregular. I sometimes got different results from the same text. And it never found everything that I pulled out manually. ChatGPT did not insert additional info in any of my tests. This is a task that I would be curious to try with a more advanced model.

While my test results were lackluster, I see promise in this work. But not enough promise, right now, to be confident using the programs or to counter the ethical dilemmas inherent in the current models. OpenAI (the producer of ChatGPT) employs thousands of low wage workers, mostly in the Global South, to screen inputs for its model, and uses a huge amount of energy and water to run and cool its servers. This is something that anyone seriously contemplating using ChatGPT, or other LLMs and generative AI models, for research tasks should consider.

Possible title/topic: Advanced note-taking for writing: personal knowledge management – for (advanced undergrads, grads, faculty); writing for theses, dissertations, publication

Note-taking serves 2 purposes for writing: 1. Writing for retrieval – writing down things you want to be able to find again. 2. Writing for thinking – writing as a way of digesting, conceptualizing and putting info in context (especially your own context.)

Writing for retrieval Proper identification of sources – citation info (formatted or just complete); source (links, files, etc.); include info on finding (search engines/databases, search terms, search or access problems, etc.) Description – tagging or other searchable text that includes context; why is this important to you; quotations (with complete citation and page numbers!) or annotations; think how you might be looking for this later on. Storage – digital storage means that you can make these notes searchable; there are many tools specifically designed for note-taking (OneNote, Evernote, Joplin, etc.) and ones designed for knowledge management (Obsidian, Logseq, Roam, Tana, etc.), but you can use any searchable file – including Word or Google docs. However, I recently had a conversation with someone who wanted hardcopy only, and we thought about writing notes (with source info) on post-its, kept originally on the printed docs, then moving the post-its as needed to a notebook or whiteboard for organizing and as writing prompts. Look up “Zettlekasten” for high-end hardcopy note-taking!

Writing for thinking Context & Connection – how would you use this; what does it remind you of; what other things do you have notes on that might be related (search and link!) Summarization – rephrase and summarize (you can't do this if you don't understand it.) Combine with other info; be explicit about connections. Don't just paraphrase by swapping out words – write your own understanding. Many people recommend individual notes for specific references (or specific ideas from specific references) and then separate notes for concepts that link/reference to the individual source notes. Update the concept notes as you collect new sources, and split off new concept notes when you have a new concept or aspect to explore. But link back to the old concept notes and to any relevant source notes.

Using notes in writing Avoid the blank page – copy useful notes (usually source notes for lit review or analysis; concept notes for more topical paragraphs) into a new document and/or into an outline to start. Save extraneous ideas for later – if something needs to be cut, you don't have to throw it away (aka “Killing your darlings”), just make a new note. This sort of note-taking should be iterative. Write some notes; this gets the ideas into your brain. Your brain can process and digest the ideas and produce new insights. Write down those insights, connecting to new and old info. Repeat.

Someone just described ChatGPT as a tool that shows you what an answer looks like (rather than giving you an accurate answer.) Which sounds useless.

But this is something that as a librarian I have a problem describing to students. In traditional (non-LLM) searching, you need to search using the language of the answer. So, for example, I tell my biology students to search using the Latin species name of an organism, because that is more likely to result in scientific articles.

So a possible use would be to show you what language the answer is likely use, how the language is used, and what related concepts you should think about. The trick is how could that be presented in a way that doesn't lead to the short cut of “That sounds like a reasonable answer. I'll stop here.”

One way is Elicit's “Suggest Search Terms” task – put in your question and Elicit pulls out common keywords and phrases. (Elicit requires an account, and it's a little hard to imagine that it's going to remain free forever.)

I heartily dislike both the AI-hype and the AI-doom-and-gloom that makes up most of the popular reading on ChatGPT and other large language model tools. These are neither the best thing since the personal computer, nor the thing that will bring higher ed crashing down around our ears. They are tools, and like all tools have appropriate and productive uses, inappropriate and concerning uses, and things for which they are not suited (but that people are trying to use them for anyway.) You can use a chisel to do wood carving, or you can use a chisel to break open a locked door. The latter might be a good thing or a bad thing depending on circumstances. You could also try to use a chisel as a screwdriver, and it might work for some screws, but it's not the best tool and you are likely to hurt yourself if you aren't very careful. (And it's not that great for the chisel, either.)

In my own experimentation, I've come up with some use cases that I think fit into the 'appropriate and productive' category. The one thing that I've found so far is that these tools are most useful for manipulating text. The real 'generative' uses seem to me to be very superficial. 'Produce a [blank] in the style of [blank]' is fun the first few times, but not very interesting overall. And mostly kind of bland (which, as someone pointed out, should only be expected: GPT produces essentially the most likely or “average” text for a given prompt (1).) More like 'produce a [blank] in the style of [blank] on a bad day.'

Here are what I've found useful. I'll add to this as I find new uses.

  1. Translation. ChatGPT levels up translations from the standard translation engines available on the web. Like all machine translation, results are a bit stilted, colloquialisms can be confusing, and less common languages give worse results, but overall, I'm pleased. I suspect that all the translation engines will be incorporating LLMs (if they haven't already) and we should see improvements in the applications soon.

  2. Text mining. I was very excited to find that ChatGPT could extract semantically useful info from narrative text. This is what I did my thesis on, and a very tedious 20K entry dataset it ended up being. I'm eager to start comparing the GPT-generated results to my previous work and to add to my dataset with new entries as soon as I'm satisfied with the quality.

  3. Search assistance. I probably shouldn't have been surprised that ChatGPT could extract semantically useful info from text, since that's exactly what the 'research assistance' apps like Elicit and Consensus do. Those specialize in analyzing research papers and pulling enough info out for you to figure out if the paper might be useful for you. Both are in heavy development right now, but can do things like extract text that is related to an actual question or pulling methodological info out of a selection of studies (population size, analysis techniques, etc.) (2)

  4. Transforming text into machine readable formats. Since ChatGPT can do translation and can also “produce in the style of” it stands to reason that it should be able to manipulate text into a particular format. And it can, at least with formatted citations into bibtex, one of the importable file formats that citation managers like Zotero uses. It would be tedious to do a lot of them at once because of the character limits, but I'm hoping someone will write a script using the API. I had hopes that it might be able to do the same with citations in a spreadsheet (CSV) but since the prompt box is unformatted text and limited characters, I couldn't get it to recognize more than a line or two at a time. It did a reasonable job on those few lines, however. Again, very tedious to do a lot, but it would work and might be suitable for some API scripting.

  5. Audio transcription. I've actually paid for a transcription program call MacWhisper that uses a variety of GPT engines to transcribe audio. It's a cut above the machine transcription available in most of the presentation tools (Zoom, PowerPoint, Google Slides) and it works locally, so it's a bit more private than the better services like Otter. It still has trouble with names, but it has a new feature that lets you put in commonly mis-transcribed words, so I can probably get it to stop misspelling my name and the CINAHL database (sin all, cinder, etc.) MacWhisper is Mac only, but I just saw a project called Audapolis that's multi-platform.

  6. Suggesting new wording. If you've got an awkward paragraph, or want to improve the flow and readability of something, ChatGPT does a decent job. One of the first things I tried with it was inputting a couple of paragraphs from something written by someone for whom English was definitely not their native language. The original was comprehensible, but stilted and had some weird word choices. I'd assume, given the translation ability, that it can do that for other languages as well. The results weren't spectacular, basically just bland American English academic prose, but definitely more readable. If I was doing this for anything real, I'd want to very carefully review the resultant text to be sure nothing had been factually changed.

  7. Avoiding 'blank page syndrome.' Since ChatGPT doesn't do well with producing accurate facts and references that exist, this is better using another tool. I found that Perplexity gave a decent summary of what might be called 'accepted internet knowledge' on a topic, and gives citations – and you can specify what sorts of things you want as references: scholarly journals, newspapers, etc. As I mentioned previously, Elicit and Consensus will both give text extractions and summaries direct from research papers. Any of this could be used to construct an outline or scaffolding to get rid of that blank page. ChatGPT can produce an outline, too, just be sure to thoroughly check any factual claims. Really, do this anyway – just because something is common on the internet doesn't mean it's right (Perplexity) and extracted sentences taken out of context may be deceiving (Elicit and Consensus.) In a way this is the reverse of #6: start with the LLM and have the human produce the final text.

These are all but the last one on the order of “take this text that I give you and do something with it” rather than “produce new text from scratch.” Only the last 2 stray into what could be considered to be a grey area ethically – how much of the writing is LLM-produced and how much person-produced, no matter which comes first?

Anyway, these are what I've found actually useful so far.

(1) https://www.washingtonpost.com/technology/2023/04/01/chatgpt-cheating-detection-turnitin/ (2) Update (2023-04-23): Aaron Tay has also been experimenting with using LLM's for search, specifically Perplexity, and he keyed in on the ability to restrict the sources used as well. https://medium.com/a-academic-librarians-thoughts-on-open-access/using-large-language-models-like-gpt-to-do-q-a-over-papers-ii-using-perplexity-ai-15684629f02b

While ChatGPT has gotten all the press, there are some “AI” (I don't actually like this term[^1] ) Large Language Model and Semantic Analysis tools out there that I think can help doing searches and finding literature.

In a theoretical search scenario, I think I'd start with Perplexity.ai (https://perplexity.ai; no registration required), an “answer engine.” It also gives you a short answer to questions, but, unlike ChatGPT, it's doing actual internet searches (or at least searching a reasonably updated internet index) and cites its sources. You can even ask it for peer-reviewed sources. This is a lot like using a good Wikipedia entry – get an overview, some interesting details, and some references to follow up on. It is, like most internet-based things, going to be biased towards whatever the majority of the sources say, so I could see it spewing some pseudoscience or conspiracy stuff out, but it does look like the programmers gave it some filters on what it uses for sources. As they say, “Perplexity AI is an answer engine that delivers accurate answers to complex questions using large language models. Ask is powered by large language models and search engines. Accuracy is limited by search results and AI capabilities. May generate offensive or dangerous content. Perplexity is not liable for content generated. Do not enter personal information.”

Then, I'd take those sources and plug them into a semantic/citation network search like SemanticScholar.org (https://semanticscholar.org; no registration required), ConnectedPapers.com (https://connectedpapers.com; no registration required), and/or ResearchRabbit.ai (https://researchrabbit.ai; registration required.) These look at the citation networks, author networks, and/or semantic relationship networks of scholarly works and display them in different ways to show you (what might be) related works. Most of these are based on SemanticScholar's database (as the most open and freely available scholarly source out there) so they mostly come up with similar results, but each has additional features that expand on the base. SemanticScholar's “Highly Influential” citations attempt determine works most closely based on the original work. ConnectedPapers looks at 2nd order citations (cited by citing articles or references, etc.) to identify what might be foundational works or review articles, and has nice network maps to explore. ResearchRabbit can look at groups of papers to find common citations and authors, and you can view results in network maps and timelines. If you register, all of these offer alerts, too, based on your searches.

Once I had a core set of works, I'd go back to the tried and true library databases, especially ones with subject headings/controlled vocabulary. Controlled vocabulary establish a single word or phrase for concepts within that particular discipline (MeSH for medicine, ERIC Thesaurus for education, etc.) Every work entered into the database is tagged with these “controlled” terms so that you can be confident that all the articles in Medline/PubMed about heart attacks are tagged with “myocardial infarction.” (There are some experiments with using semantic algorithms to tag database entries, but to the best of my knowledge all or most of the traditional sources still use humans as quality control.) By looking up some of the articles I found through the other sources, I could find the relevant subject headings and use those to search out more results.

Ellicit.org (https://elicit.org; registration required) is another GPT-based tool that bills itself as a “research assistant.” It's a little more complicated, but has some very interesting features. It pulls out quotes from the results that it determines are relevant to your question or topic. You can ask it for particular types of research (qualitative, etc.) or have it highlight aspects of the research, like study size. There are additional “tasks” besides the general search feature – one of which is to find search terms related to a topic! It's still very much in the experimental stage, but also very intriguing.

So...with all of these new tools, am I worried about being replaced by a librarianbot? No, I'm not.

  • Using these tools requires skill, which means either training or time and willingness to experiment. In my experience, most people want training and are happy to outsource the experimentation to people like me.
  • The scholarly publishing world is complicated and while open access has made a lot of stuff more easily available, it's also made things even more confusing. I get a LOT of questions about accuracy, currency, and other quality issues and I do not see those going away anytime soon. (And in the short term, I think those will be even more of an issue as tools like ChatGPT generate plausible but inaccurate text that gets put out there unidentified.)
  • Access is still a big issue, and librarians are the people that most institutional scholars turn to for access issues.
  • A lot of people like working with people or at least having a person available to them. It's reassuring to know a real person has your back. “Hand holding” (you know what to do but you want me to reassure you that you are doing it right) is a big part of learning, especially in this age of anxiety.
  • Most of these tools remove tedium. Younger scholars have no idea what I mean when I say that the biggest benefit of online databases is being able to search more than one year at a time. Only people who remember printed indexes (or at least CD indexes) can appreciate NOT having to search each yearly (or semi-annual) volume one after another after another... I'm quite happy not doing that any more. It means I get to do more interesting things, like working on systematic reviews with other researchers, or investigating new search tools, or teaching citation and note-taking systems.

Which means I'm excited for these new tools, as long as they are producing useful results. I'm less excited about tools that produce mis-information, like ChatGPT's made up citations[2] or the AI-powered voice synthesizer that everyone except the promoters predicted would be used for faking celebrity statements.[^3]

So go out and enjoy the AI (again, see footnote 1). No one is stuffing this genie back into the bottle, so we need to learn how to live with it. And it can make some things better.

[1] I don't like the term AI/Artificial Intelligence because we all grew up with science fiction and AI that was actually AI – artificial beings with what we could easily recognize as human-like consciousness and intelligence. (Putting aside for the moment the problem that we often don't recognize or want to recognize the intelligence of other human beings – often the point of those science fiction stories.)

[2]

[3] https://gizmodo.com/ai-joe-rogan-4chan-deepfake-elevenlabs-1850050482

Update (2023-04-23): Aaron Tay has also been experimenting with using LLM's for search, specifically Perplexity, and he keyed in on the ability to restrict the sources used as well. https://medium.com/a-academic-librarians-thoughts-on-open-access/using-large-language-models-like-gpt-to-do-q-a-over-papers-ii-using-perplexity-ai-15684629f02b

If, as implied by our Provost's email, a possible response to COVID-19 could include moving courses online, some provision for library services to those courses will be required.

Much depends on the exact nature of the quarantine or other restrictions. In other similar situations I have heard of vulnerable employees being placed on a leave of absence, closing the library to the public but continuing services otherwise, allowing staff to telecommute, etc.

From conversations with librarians at other institutions who have dealt with emergencies that prevented library staff from coming in (mostly during natural disasters), I can summarize the strategies:

  • It should be determined if the institution can provide additional disinfection supplies (hand sanitizer, disinfecting wipes, etc.)
  • Supervisors should determine what work can be done remotely and which employees can work remotely (i.e. have the required equipment and/or internet services to do so).
  • If employees need additional equipment (laptops, barcode scanners, etc.) that equipment should be identified and prepared.
  • Provisions and practice may be needed for subject librarians to provide reference and instruction online. It should be determined who can provide video conferencing vs. web chat, for instance, based on equipment, connections, skills, and experience. Practice sessions can be arranged.
  • Assuming that some staff are allowed in with the physical collections, protocols for requesting scans of library materials need to be established. -If staff are not allowed into the building, then extra budgetary resources may be needed for ILL and document delivery.

A useful article summarizing the library response to the 1918 pandemic and with planning advice is “In Flew Enza” from American Libraries (Quinlan, 2007).

Advance planning provides the best defense against emergencies and reduces stress for both employees and patrons.

Quinlan, N. J. (2007). In flew Enza. American Libraries, 38(11), 50–53.

(Updated 2019-06-27; updated 2023-01-16 to note Microsoft Academic's demise; also to note that I am no longer using Kopernio, which is now part of Clarivate's Endnote. I need to make a new list!)

A list of online tools, services, and software I'm finding useful.

  • Kopernio (Chrome, Firefox) https://kopernio.com/ Now owned by Clarivate Analytics (owners of Web of Science and Journal Citation Reports), Kopernio is a browser plugin (Firefox & Chrome) that can find PDFs of articles via both open access sources and library subscriptions. Also has online storage “locker” for PDFs. Obviously, some data harvesting is going on, but it really simplifies finding full text articles. The Google Scholar plugin (from GS) has similar functions (data to Google, obviously) and Lazy Scholar is an independent plugin that also allows you to access your library's subscriptions. Unpaywall and the Open Access Button are other plugins that only find open access sources. All good, but right now I'm trading data (with a reasonable data privacy policy) for convenience with Kopernio.
  • Zotero (Mac, Windows, Linux) https://zotero.org/ My preferred citation manager, and the only major one currently NOT owned by a big publisher. (Elsevier owns Mendeley, Clarivate owns Endnote, and Proquest owns Refworks.) But besides that, I prefer it for the ease of import and export. Generally, I find the import options work better and it exports in more formats (RIS, bibtex, csv, etc.) than other software. It comes with over 9000 citation styles, and reads the open source format CSL so you can modify or create any style you need if those 9000 aren't doing it for you. If you just need a quick formatted citation, try their online service, ZoteroBib, https://zbib.org/
  • Anystyle.io (online) https://anystyle.io/ Convert text formatted citations (APA, MLA, etc.) into bibtex or other machine readable citation formats. It's essentially a proof of concept, machine learning GitHub project, but I was able to take around 100 references from a dissertation and get a file to import into Zotero in about half an hour. Not bad for something “intended for non-commercial, limited use.” Certainly a LOT faster than if I'd been typing (or copying/pasting) them in, or even doing a search-and-import. It's not instantaneous, but you have a lot of control over the process, so you can fix errors before importing.
  • Texts (Academic Markdown Editor) (Mac, Windows) http://www.texts.io/ I'm trialing this markdown editor right now, and I am liking it. A minimalist writing tool, it uses markdown language (think HTML for more general text) to structure documents (and structure leads to formatting.) You can do all the major document structures: lists, numbered lists, headings, footnotes, quotations, and – unusually for these markdown editors – citations. (Citations are a little tricky, but once you've got the system down, it's not too hard.) Once I have my basic text, I can export in a number of formats, including PDF, RTF, Word, HTML, ePub, LaTex, and a very nice minimalist HTML presentation style that I would be very happy to present with at a professional conference. What's really neat is that the SAME FILE can be exported in multiple formats, so a properly structured text document can automatically make a presentation, etc. Those are the benefits of markdown; Texts just makes it all really easy. The only drawback I've seen so far is an inability to resize images. I think it's a 30 day trial, and I've seen a price of $19 in reviews (though I'm not seeing a cost on the website right now.)
  • Microsoft Academic Search (now defunct, 2022) Microsoft's answer to Google Scholar, with extras. Besides the general indexing, Microsoft has added a semantic analysis component, so that things like institution are parsed out of articles and become searchable. Each document entry includes the usual citation, abstract, and sources (OA direct downloads), but also how the document fits into the citation network (references and citing articles) and all the parsed topic, institution, journal, date, and author data. Plus, a “related” search that uses a semantic analysis to find similar documents. My librarian's heart is cheered by the fact that it also lists my research guides as scholarly documents (even if it did take a bit of work to “claim” all of them when I set up my profile.) Two drawbacks: 1) apparently you can't use your institutional Microsoft account (Office 365, etc.) to login – at least it's never accepted mine; and 2) there is no ability to link to an institution subscription login. (However, see Kopernio, above.) It's currently a smaller database than Google Scholar, but it's growing, and it has some very nice features of great use to researchers and students.
  • JSTOR's Text Analyzer (online) https://www.jstor.org/analyze Speaking of semantic search, JSTOR's Text Analyzer (beta) does that, too, and shows you what it's doing. You don't need a JSTOR subscription to use it, just go to the page and upload another article or paste in some text and see what comes up. Then you can play around with the search features to refine the search. (If you don't have an institutional subscription, btw, JSTOR has a number of options for independent scholars that are very affordable.) JSTOR has a nice write up on teaching with it, too, at https://daily.jstor.org/how-to-teach-with-jstor-text-analyzer/. Other suggestions I've run across include having students use the subject terms to help interpret and summarize an article. (JSTOR's own video implies that you can use it for that “I wrote the paper, now I need sources” style of academic writing. Librarians would prefer to discourage that, however.)
  • Publish or Perish (Mac, Windows, Linux) https://harzing.com/resources/publish-or-perish Anna-Wil Harzing wrote the Publish or Perish software to help academics document and discuss the impact of their research. You can get a citation report from all the major citation sources (some you need subscriptions for). She's got extensive documentations, including print books

Good post with more tools: https://medium.com/@aarontay/top-new-tools-for-researchers-worth-looking-at-9d7d494761b0

I mentioned doing a workshop covering some advanced uses of Zotero that included using searchable codes for organizing writing projects, documenting search strategies for formal lit reviews, and using the results list importer and the multi-format exporter for citation analysis projects. Here is a summary of my notes.

using searchable codes for organizing writing projects – this can be used in any citation manager that has a note or tag field, but it works particularly well in Zotero because the tags feature is separate from other keywords. I learned this tip from a PhD student who was using Endnote, back in the day.

Come up with codes for each section, argument, or other organizational feature of whatever you are writing. They should be unique to the project (i.e. not “Intro”) because once you've tagged citations with them you will be able to search for the tag and pull up everything you meant to use in that section.

The key here is that Zotero is searchable, and it's searchable using ANYTHING that you put into it. Use that.

I've never matched my friend's very elaborate coding system for his dissertation, but the idea has been handy for projects where I'm juggling a lot of citations.

documenting search strategies for formal lit reviews – in a formal lit review (systematic, integrative, etc.) you generally want to keep track of what you found using what search, but you also need to be able to eliminate duplicates and not get overwhelmed. Zotero to the rescue. In the process I worked out with a couple of researchers, a new folder is created in Zotero for each search. The results (yes, all the results) are exported from whatever database or search engine being used – set the results pages to display the greatest possible number of results per page and hit the web importer, going page by page.* Then create a new folder and do the next search etc. Each folder has the exact number of results per search.

Once all the searching is done, run the de-dup tool. De-dup merges entries, but keeps the merged entry in the original folders. At this point, create a Review folder and copy all the entries into the folder to run any additional review/selection criteria against. After this, you don't touch the search folders. They are your archive. The ones that make it through the additional selection get put into yet another folder, ready for the final analysis (whatever that is based on the purpose of the review.) This step can be done all at once or step by step if the selection criteria are more complicated. Once each step is finished and the results are in the folder for the next step, that previous folder is left as an archive. Never remove anything from a folder, just create a new folder for the selected entries.

If the review is being done by more than one person, separate folders can be created for each person's selection criteria phase or they can be done by consensus (tagging can work for this.)

If additional searching turns out to be necessary, you can add more search folders, run the de-dup again, and add the new results to the review folder. If the selection criteria need to be adjusted, the original review folder is still there, untouched, ready to be reanalysed.

There is a trail of exactly what was done, what the results were, and Zotero tracks when things were added to the library, so there is even some chronological tracking.

using the results list importer and the multi-format exporter for citation analysis projects – Citation analysis projects are looking for trends within a collection of citations. Those might be the results of searches, or the references used in a particular body of literature, or... One of the early projects I helped with of this sort was an investigation into the use of certain words within a field of study. Once the search was finalized and tested (did we want titles, abstracts, anything else? etc.) all those results were imported into Zotero.

Zotero is useful for this because not only is it relatively easy to import large numbers of citations, but it's also easy to export them into analyzable formats like .csv files. Obviously, if you don't need to clean up your citations, and your chosen analysis program can read bibtex or whatever format you've already got, you don't need Zotero here.

First – getting things in. As mentioned above, results pages from searches can be imported relatively easily by setting the number of results per page to the highest available and using the web importer ('Select All'). (* again) You can also be more selective and collect citations within the database (assuming that you can mark

Sometimes, it's not a search, but an existing collection of citations, like reference lists. If those are available in a machine readable format, like RIS or bibtex, everything is easy. But often it's text. I just started using AnyStyle.io and it works quite well. Export the results in bibtex and Zotero adds them with no problem. If needed, some clean up can be done in Zotero, or citations can be filled in by searching in standard databases, importing the results, and de-duping. (I was using the review feature in Mendeley to do this, but I found that because Mendeley is checking against it's own database of user citations, there is no guarantee that the citation is going to be any better than the one I started with and sometimes I got the wrong citation entirely.)

Second – getting things out. This is really easy with Zotero. If your library is entirely one project, just Export the Library in whatever the most useful format for your analysis. Probably that's CSV, but there are many options. If you are only exporting a portion of your library, it's almost as easy. Select the citations for export, right click (CTRL-click for Mac) and choose Export. Once you have your CSV or whatever file, you can import it into your chosen analysis program, whether that's Excel, R, Python, or whatever, and see what trends your citations reveal.

*This generally works better with a good connection and NOT including PDF files. It also encourages good search practice because, yes, we're talking ALL the results. It also may trigger downloading limits in some databases, so you may want to check with the vendor before doing a really big project, especially if you are pulling full text and not just citations.