Computer Science: Data Mining & Visualization

Guide content supports the teaching and research goals of multiple departments on campus. Content represents a non-exhaustive selection of essential resources and tools for engaging a wide range of backgrounds and viewpoints. 

Data Mining & Visualization

Overview

UI Sources

Free Data Sources

Tools

The Text & Data Mining Process

Text and data mining (TDM) are becoming increasingly popular ways to conduct research. They entail using automated tools to process large volumes of digital content to identify and select relevant information and discover previously unknown patterns or connections. Text mining extracts information from natural language (textual) sources. Data mining extracts information from structured databases of facts. The extracted information is assembled to reveal new facts or to formulate hypotheses that can be further explored using conventional methods. TDM is useful in many disciplines, from the humanities, where it is used by digital humanities scholars, to the sciences, where useful data can be mined from large non-text datasets and textual databases of published literature.

Terms & Definitions
Term	Definition
Application Programmers/Programming Interface	The technical window/programming language interface through which users can access and obtain vast quantities of information (text/data/objects) in a machine-readable format.
Corpus	A collection of documents such as webpages or journal articles.
Crawling	A method that automatically finds links within a website and "scrapes" the information from them (see scraping) so that it can then be "cleaned up" and made machine-readable.
Document Type Definition	The mark-up of a document created through a coding language such as HTML or SGML to recognize the structure and tag text to show how a document should be understood by computers.
Entity	Refers to a real world thing (e.g. a name).
Extensible Mark-up Language	A web standard for document mark up, designed to simplify and provide flexibility to Web and other digital media authorship and design. Unlike HTML, it is not a fixed format language.
Hypertext Mark-up Language	A text-based coding language interpreted by web browsers and used to construct web pages.
Information Extraction	Automatically isolating specific data (e.g. identity) from unstructured text.
Lema & Lexim	A lemma is the word, but a lexeme is a unit of meaning, and can be presented in multiple words. For example, in English, read, reads, reading are the same lexeme, but have different lemma (forms).
Machine Learning	A mathematical or statistical algorithm that automatically identifies (learns) patterns in data.
Natural Language Processing	Software or services facilitating the automatic analysis of text.
Ontology	The organization of a specific domain with the entities that belong in it and their relationships.
Ontology Web Language	A representation of relationships between entities in a way that computers can process them.
Parsing	(Linguistic) parsing refers to the process of (syntactic) analysis of text and breaking down a sentence into its component parts (in machine terms, a file can be "parsed" into its component parts).
Relationship Extraction	The process of automatically finding "semantic relationships" between two (or more) entities.
Scraping	The process of identifying, copying, and pasting information into files that can be later "cleaned up" or made machine-readable.
Semantic Relationship	A linguistic relationship between two or more entities expressed in a way that can be understood by a computer.
Sentiment Analysis	The extraction of words or phrases that convey meaning.
Standard Generalized Mark-up Language	The most comprehensive of all coding languages (XML, and HTML, for example).
Stop List (or stoplist)	A set of words automatically omitted from a computer search, concordance, or index because they slow down processing of text or produce false results.
Taxonomy	Specific vocabulary that expresses relationships, organizes information in a hierarchical or linear manner.
Text and Data Mining	The extraction of natural language works (books or articles, for example) or numeric data (i.e. files or reports) and use of software that read and digest digital information to identify relationships and patterns far more quickly than a human can.
Token	A token represents a word type - similar to "part of speech" in linguistics and is used to measure lexical density (the ratio of lexemes to the total number of tokens). In terms of writing, lexical density measures how informative a text is. Tokenization is the process of assigning word types.
Treebank	This is a corpus of syntactically parsed documents used to train TDM models.

Duke Library's Introduction to Text Analysis
This guide is a great resource that provides information for working with text: cleaning, parsing, methods, and analysis/visualization tools.
How does text mining work?
Understand the basics of how text and data mining works and how it is used to help advance science and medicine.
What is text mining?
An introduction to the basics of text and data mining?

Since the mid-80s, technology propelled text and data mining to prominence across disciplines. Increased interest in the field surfaced multiple issues such as copyright, fair-use, and commercial viability. For example, the flexibility of copyright laws in the US, Israel, Taiwan, and other countries deems TDM transformative and, thus, lawful under fair use (see the Authors Guild v. Google, for example). Here, we introduce a few links to sources that will shed light on major issues in TDM:

Carnegie Mellon University and Georgetown University have created Six Degrees of Francis Bacon, a groundbreaking digital humanities project that recreates the British early modern social network to trace the personal relationships among figures like Bacon, Shakespeare, Isaac Newton and many others.

From Culturomics, Bookworm is a simple and powerful way to visualize trends in repositories of digitized texts. Users must register for an account to create their own "bookworm."

Another example that uses early modern text is Northeastern University's Women Writers Project. This project allows researchers to study pre-Victorian women writers text in a new way and enables them to extrapolate existing relationships in ways that are far more possible than can be done through close reading.

There are also projects that analyzes visual data. For example, Yale's Robots Reading Vogue analyzes text and visual images to explore questions of gender studies and other perspectives .

Springer APIs

Springer APIs at https://dev.springernature.com/, where you can also sign up for an account and key to begin querying the meta and openaccess APIs.

ArXiv
Collection of electronic pre-prints in various sciences. The archive is based upon activities supported by the U.S. National Science Foundation under agreement with the Los Alamos National Laboratory, and by the U.S. Department of Energy.
BioMed Central
BioMed Central is an independent publishing organization committed to providing immediate and free access to peer reviewed biomedical research.
Public Library of Science
PLOS is a nonprofit, Open Access publisher empowering researchers to accelerate progress in science and medicine by leading a transformation in research communication.
PubMed Central
PubMed Central® (PMC) is a free full-text archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM).

English Corpora
ALL FIRST TIME USERS MUST REGISTER WITH ENGLISH CORPORA, AND THEN SPECIFY “UNIVERSITY OF IOWA” IN THEIR ACCOUNT INFORMATION. Please direct any questions to the Education & Psychology Librarian at kelly-hangauer@uiowa.edu. The English Corpora include some of the most widely-used online corpora. This resource allows users to find out how native speakers speak and write; find frequency of words, phrases, and collocates; look at language variation and change; gain insight into culture; and design authentic language teaching materials and resources.
Eighteenth Century Collections Online (ECCO)
A comprehensive digital edition of The Eighteenth Century microfilm set, which has aimed to include every significant English-language and foreign-language title printed in the United Kingdom, along with thousands of important works from the Americas, between 1701 and 1800. Consists of over 180,000 titles of books, pamphlets, broadsides, ephemera. Subject categories include history and geography; fine arts and social sciences; medicine, science, and technology; literature and language; religion and philosophy; law; general reference. Also included are significant collections of women writers of the eighteenth century, collections on the French Revolution, and numerous eighteenth-century editions of the works of Shakespeare. Where they add scholarly value or contain important differences, multiple editions of each individual work are offered.
Internet Archive
The Internet Archive offers over 20,000,000 freely downloadable books and texts. There is also a collection of 2.3 million modern eBooks that may be borrowed by anyone with a free archive.org account.
JSTOR (Journal Storage)
Provides image and full-text online access to back issues of selected scholarly journals in history, economics, political science, philosophy, mathematics and other fields of the humanities and social sciences. Consult the online tables of contents for holdings, as coverage varies for each titles. Journals may be searched across multiple titles as well as by the individual titles below
Note that this database comprises mostly back issues: for most titles the JSTOR database does NOT include full text of the most recent 3 to 5 years.
Project Gutenberg
Project Gutenberg is an online library of free eBooks. Project Gutenberg was the first provider of free electronic books, or eBooks. Michael Hart, founder of Project Gutenberg, invented eBooks in 1971 and his memory continues to inspire the creation of eBooks and related content today.
Oxford Text Archive
The Oxford Text Archive (OTA) provides repository services for literary and linguistic datasets. In that role the OTA collects, catalogues, preserves and distributes high-quality digital resources for research and teaching. We currently hold thousands of texts in more than 25 languages, and are actively working to extend our catalogue of holdings. The OTA relies upon deposits from the wider community as the primary source of materials.
University of Pennsylvania Online Books Page
The Online Books Page is a website that facilitates access to books that are freely readable over the Internet. It also aims to encourage the development of such online books, for the benefit and edification of all.

Chronicling America: Historic American Newspapers
Chronicling America is a Website providing access to information about historic newspapers and select digitized newspaper pages, and is produced by the National Digital Newspaper Program (NDNP). NDNP, a partnership between the National Endowment for the Humanities (NEH) and the Library of Congress (LC), is a long-term effort to develop an Internet-based, searchable database of U.S. newspapers with descriptive information and select digitization of historic pages. Supported by NEH, this rich digital resource will be developed and permanently maintained at the Library of Congress. An NEH award program will fund the contribution of content from, eventually, all U.S. states and territories.
New York Times (Proquest Historical Newspapers)
Full-text access to the New York Times from 1851 (New York Daily Times) to 2014. Documents display in PDF.