Importing Data

This page describes some of the different ways you can import data into Heurist from external sources. Import methods include a powerful Delimited Text File Importer, for importing database tables and spreadsheet data; Zotero Bibliographic Synchroniser; KML Importer; In-Situ File Indexer (typically for large image collections).

Delimited File Text Importer

This is the best way of importing data from an existing spreadsheet or database table. It allows matching of existing records based on one or more columns, and for data to be split out into new record types linked with a pointer field (e.g. pulling out Authors or Place Names which are repeated for many records in the input data). It handles multiple values in a column, multi-line text columns and is very tolerant of imbalanced quotes and other typical CSV/TSV gotchas.

ImportDelimitedV4

It can be extremely hard to import complex comma-separated or tab-separated data (CSV or TSV files, hereafter ‘delimited text files’), such as those exported by most spreadsheets, into a database. The problem is particularly acute for Humanities disciplines where data often include lots of text fields containing commas, isolated apostrophes and single quotes (often both  ‘  and  `  ), quotations within text (often with distinct start and end quote marks “ ”  or ” ), and multiple lines of text in one field. Delimited text files are also intrinsically limited to recording one type of entity per file, and have no mechanism – other than row references or field matching – to render relationships between different entities. This problem is addressed below. 

Handling Textual Data

Delimited text files were simply not designed to cope with unstructured text, and the choice of common characters as field separators/delimiters (particularly commas) and field enclosing characters (single or double quotes with several variants) exacerbates the problem. Yet delimited text files are often the only data transfer mechanism available to end users.


The following example illustrates a few of the potential problems faced in interpreting a line of CSV data. These conditions can be processed correctly with appropriate software design (comma delimiters, various double and single quotes enclosing fields):
1234,John's House,Known as "The Mill",He said "I will not move"" , “Erith, near St Neot" 
Potential problems: isolated single quote/apostrophe in unquoted field; quotation within unquoted field; double quotes within double quoted field; imbalanced quotes surrounding fields, spaces before/after delimiters.
This example, on the other hand, is a forced error, since the unquoted delimiter within a field creates an extra field; this condition is easily detected due to the one-line-per-record design.

1234,John’s House,Erith, near St Neot


More often than not it will take a lot of prior data cleaning to get delimited text files into a form amenable for import. Typical delimited import functions give little help in finding the errors because they either see line endings within text fields as new records, or they ignore line endings altogether and count delimiters, rapidly losing synch with records as soon as they hit an imbalanced delimiter. In designing Heurist’s delimited import function, we have aimed to solve these problems by addressing the issues encountered through a number of projects where we had to deal with large amounts of poorly structured textual data.

First, Heurist applies a one-record-one-line rule. This may initially seem to fly in the face of multi-line text fields, but these are easily handled by replacing within-field line endings by CR only using a script or a suitable text editor such as NotePad++. By enforcing this rule, it is easy to identify and list individual records with missing fields or imbalanced commas or quote marks as the first import step, allowing rapid cleanup. Line feeds delimiting the end of each record are auto-detected or can be forced to Windows, Unix or Mac for problem cases.

Secondly, Heurist allows for single quotes ( ` or ‘ ) and paired single or double quotes ( ‘ ` “ ”  ” ) within an unquoted text field, while correctly splitting the line into fields. This obviates the need to surround text fields in quotes, provided the delimiter (comma or tab) does not occur within the data (use of tab delimiters is recommended, but commas within fields of a CSV file can be handled by enclosing the field in balanced single or double quotes, even if the field contains other quote marks). To import a delimited text file into Heurist, we recommend using tab delimiters – this avoids any problem with commas in unquoted fields. Errors encountered when the data is parsed by the importer will be reported by record number, along with a dump of the record contents. This is normally sufficient to easily locate the problem data in the source file.

Multi-value Fields

Heurist recognises additional delimiters ( | recommended ) as potential delimiters between several values stored in a single column of the text file. This will result in slightly different behaviours when the column is used as part of a set of key columns to identify records in the database (search/match phase) or imported into a specific field in the data (insert/update phase). If a multi-value column is included in the record matching phase, matching will be carried out on each of the values in turn combined with the other key columns, and multiple record IDs will be inserted in the ID field generated, or multiple New Record markers. If a multi-value column is included in the columns to import in the insert/update phase, the data will be inserted as repeat values  for the field.

Multiple Entity Types with Delimited Text Files

By nature, delimited text files can only handle a single entity type – each row describes one type of ‘thing’ with a fixed set of attributes. Such files may reference other related entities by including the name or identifier of an entity in another file (a ‘foreign key’ in database parlance, or simple (pointer) relationship). The single-value-per-column nature of most delimited files limits the ability to record multiple relationships, let alone types of relationship, a limitation which is sometimes circumvented by using multiple columns (either to allow multiple relationships or to allow multiple kinds of relationships or, horrifyingly, both). Other schemes may involve encoding relationship type (captain, first mate, start and end) and related entity (person, port) in some form of composite field along the lines of:

person:captain:37; person:first mate:23; port:start:12; port:end:15
person:captain:William Bligh; person:first mate: Fletcher Christian; port:start:Spithead; port:end:Tahiti

Apart from the huge potential for errors, such complex relationship data is impossible to process without bespoke programming on a case-by-case basis. Heurist will allow the import of simple relationships (pointers) defined by matchable columns. The data above could be reorganised as:

Ship, Captain, First Mate, Start Port, End Port
 Bounty, William Bligh, Fletcher Christian, Spithead, Tahiti

Provided the names of the people are recorded in a column in a file of people and the names of the ports are recorded in a file of ports, these files can first be imported, then the columns can be matched* against the database to replace the names of the people and ports with the corresponding record identifiers, and these identifiers can be imported as values into Record Pointer fields for Captain and First Mate (pointing to Person records) and Start port and End port (pointing to Port records). *Matching can be carried out using a single composite name “William Bligh” or with separate family name and first name fields, or with numeric identifiers, provided the same form is used in both files. Alternatively, the data could be structured as follows to import the data as full relationship records:

Ship, Person, Role
 Bounty, William Bligh, Captain
 Bounty, Fletcher Christian, First mate
 Bounty, Spithead, Start Port
 Bounty, Tahiti, End Port

The ships can be imported from the first column and replaced with the corresponding record identifiers, the people can be imported from the second column and replaced with their corresponding identifiers, finally the relationships can be imported setting the source record pointer to the first identifier, the target record pointer to the second identifier, and the relationship type to the Role value* *This requires the terms used in the Role column to be entered as relationship terms in the Heurist database. A list of the terms used (excluding any already in the database) can be obtained by running the import to the point where term errors are reported.  Note also that once relationships have been imported in this way, they will show up in the Relationship Marker fields under Ship or Person if appropriate fields have been defined (i.e. fields with the appropriate target record type – Person in the case of a Crew Member field in a Ship record – and the appropriate relationship terms: Captain, First Mate etc.).

Converting within-field line endings

The following Unix script instructions may be useful for pre-processing files with multi-line fields.

  • Add an initial column containing a distinctive fixed string such as “StartOfRecord” – this will be used to identify the breaks between records.
  • Save the following line of perl as filt.pl perl -e ‘undef $/; while (<>) { s/n/\n/g; s/r/\r/g; print $_, “n” }’  (thanks to ‘Zack’ on stackOverflow.com)
  • Run this filter as follows:
    • chmod -a+x filt.pl
    • cat myfile.csv | ./filt.pl > myfiletransformed.csv
    • Edit myfiletransformed.csv with a wordprocessor;
      • note, the file will appear as a single line of text.
      • MS Word handles this fine for 1MB files, some text editors may choke.
      • Change “StartOfRecord,” or “StartOfRecord<tab>” to <newline> (for Word, <tab> = ^t  <new line> = ^p )
      • Save as a TEXT file
Zotero Bibliographic Synchroniser

The Zotero Bibliographic Synchroniser reads a Zotero web library and updates bibliographic records in your Heurist database. This allows the inclusion of bibliographic information from Zotero in searches, visualisation and output from Heurist.

KML Importer

The KML importer will carry out a basic import of a KML file, creating a record for each object in the KML file. It primarily imports geometry and label; it does not do a good job of other attributes. If you need to import a set of geometries (often points) from KML to create individual records, and require better attribute handling, we recommend using a GIS tool (including many free services on the web) to write out a delimited data file (with a WKT  field for the geometries if importing lines or polygons) and then using the delimited file importer to import the data.

Index Multimedia

Primarily intended for rapid ingest of large numbers of multimedia files, the Indexer allows indexing of files on the database server and the creation of multimedia records in the database. Metadata such as title, file size, geographic location and MD5 checksum are automatically extracted from images. The indexer also reads and creates metadata compatible with the FieldHelper metadata editor.

Other Import Methods

With a little understanding of the underlying data tables (Records, recDetails) it is relatively easy to do direct SQL insertions of data into Heurist for bulk data insertion. We are happy to assist with bulk data import on a consultancy basis. Contact us here.

Older Importers

EndNote Bibliographic Importer

There are two methods of importing bibliographies: the Zotero bibliographic Synchroniser (see above) and an Endnote Refer Format Importer. We recommend using Zotero for bibliographies and synchronising your Heurist database(s) with the online version of your library. The Refer importer has not been fully updated to current standards and we cannot guarantee what results you will get – if you need to use it, please contact us first and send a sample bibliography with a variety of the bibliographic types you use.

Old CSV Importer

This importer requires the delimited data to be pasted into a form in Heurist and creates new records. It will not match existing records, but simply creates duplicates. It has largely been superseded by the delimited file importer above.

Old Detail Updater

This importer inserts data for fields (‘details’) in existing records, specified by their record ID and internal field code (which is listed in the record type structure edit form). Like the old CSV importer, data is pasted into a form in Heurist. This function has also been largely superseded by the new importer.