Report Bug

Core Concepts: Datasets, Tools & Boxes

QuickCode is about datasets. The simplest dataset is just an SQLite database, shared with a bunch of tools.

Tools act on datasets. Tools are things that populate, update, visualise or export a dataset.

Any tools ‘installed’ into a dataset have access to the shared SQLite database. They can also save functional data into their own private databases, or their local filesystem.

Tools live in boxes – essentially Unix shell accounts on the web. Tools are installed into boxes, and run independently inside those boxes, talking to each other via the scraperwiki.js library or the underlying box API endpoints.


Structure of a tool

Tools are, essentially, files in a git repository which are built to run automatically when installed into a new box. By way of example, here are the contents of the Spreadsheet upload tool:

code/
 └ extract.py
http/
 ├ done.html
 ├ index.html
 └ style.css
test/
 ├ simple.py
 └ tricky.py
README.md
scraperwiki.json

When someone activates the Upload Spreadsheet tool, QuickCode creates a new box, symlinks a global checkout of the Git repository into the box’s /home/tool/ directory, and shows the tool’s http/index.html file to the user in an iframe.

This index.html file contains JavaScript that reads settings and generates a user interface for selecting a spreadsheet to upload. It uses the scraperwiki.js library to select data and run the server-side extract.py script.

The scraperwiki.json file contains the following settings describing the tool:

{
  "displayName": "Upload a Spreadsheet",
  "description": "Upload an Excel file or CSV",
  "icon": "https://s3-eu-west-1.amazonaws.com/sw-icons/tool-icon-spreadsheet-upload.png",
  "color": "#029745"
}

A README.md file is included in the Git repo so collaborators on Github know what the tool does. This file is not read by QuickCode. You might want to put technical configuration instructions in here.

The test/ directory contains Python unit tests used during development. We suggest you write tests for your tools.


Tables and grids

QuickCode is about data. Most of that data is ridigly structured, often stored in SQL tables.

But sometimes you have to deal with data that has no structure – like an array of cells in a spreadsheet. There may be structured data buried inside it, but right now it’s a bag of cells. We have a term for these bags of cells: grids.

We have a standard for handling unstructured "grids" of cells, and if you follow it, your grids will be viewable using the “View in a table” tool and downloadable using the “Download as spreadsheet” tool.

A grid is stored on disk as an HTML file. It must contain a <meta charset="…"> tag, and a single <table> element. For example:

<meta charset="utf-8">
<table>
  <tr><td>Eggs</td><td>Milk</td><td>Cheese</td></tr>
  <tr><td>Newspaper</td><td></td><td></td></tr>
</table>

Your HTML markup should conform to the HTML 5 spec.

To make your grids discoverable, create a _grids table in your SQLite database, eg:

CREATE TABLE "_grids" ("url" NOT NULL, "checksum" PRIMARY KEY, "title", "number" INT, "source_url", "source_name", "source_count")

The columns should be used as such:

url Required URL of the grid HTML file
checksum Required 32-character, alphanumeric MD5 hexdigest of the grid content (useful as a unique identifier for the grid)
title Grid title, such as the name of the spreadsheet tab it came from, or the caption it had in the source document
number 1-indexed number of this grid, out of all the grids in the source document (useful for ordering grids)
source_url URL of the source document from which the grid was extracted
source_name Name of the source, such as the filename and extension
total Total number of grids in the source document

How to use scraperwiki.js

When you’re writing your own tools, you’ll want to use the scraperwiki.js library to interact with your tool’s underlying box and the main dataset itself.

Scraperwiki.js requires jQuery. You’ll want to include it in your view’s HTML <head>, after you load jQuery.

<script src="//ajax.googleapis.com/ajax/libs/jquery/1.9.0/jquery.min.js"></script>
<script src="//scraperwiki.com/js/scraperwiki.js"></script>

Scraperwiki.js function reference

Read settings

When somebody loads your view, QuickCode passes it a number of settings. The settings are in a URL encoded JSON object after the # in the URL. You can use this helper function to access them easier:

scraperwiki.readSettings()

This will return an object where source contains information about your tool’s box, and target contains information about the parent dataset’s box.

{
  "source": {
    "apikey": "<your-apikey>"
    "publishToken": "t5odv7of5l",
    "box": "fdtlza1",
    "url": "https://free.scraperwiki.com/fdtlza1/t5odv7of5l"
  },
  "target": {
    "publishToken": "a1pax8jk32",
    "box": "de3jikz",
    "url": "https://free.scraperwiki.com/de3jikz/a1pax8jk32"
  }
}

For example, this gets the target dataset’s SQL data endpoint.

scraperwiki.readSettings().target.url + "/sqlite"

Read your tool’s box name

The name (random letters and numbers) of the current box is in scraperwiki.box.

Read the browser’s URL

Sometimes you will need the URL of the outer page, for example to read query parameters or to get a URL for OAuth to redirect back to. Getting the URL happens asynchronusly using XDM, so you need to pass in a callback function.

scraperwiki.tool.getURL(function(url) {
  console.log(url)
})

Get user details

You can get information about the person using your tool with:

scraperwiki.user.profile(function(userDetails){
  console.log('your full name is', userDetails.real.displayName)
  console.log('your username is', userDetails.real.shortName)
  console.log('you are in the', userDetails.effective.displayName, 'data hub which has the short name', userDetails.effective.shortName)
})

Redirect the browser

Since you’re in a secure iframe, you need to call a special function to redirect the browser to another location. For example, this redirects to a dataset’s default page:

var datasetUrl = "/dataset/" + scraperwiki.box
scraperwiki.tool.redirect(datasetUrl)

It’s also useful for redirecting to OAuth endpoints.

Rename your tool or its parent dataset

By default, datasets or views created by a tool adopt the name of the tool itself. You can change the name of datasets, or of views made using the "Code a view" tool, by calling scraperwiki.tool.rename(). It takes one argument: a string to which the dataset or view should be renamed. For example:

var username = "@scraperwiki"
scraperwiki.tool.rename(username + "'s twitter followers")

Display alerts

scraperwiki.alert() takes three arguments and displays an alert bar at the top of your dataset / view:

Argument Description
title A string to be shown in bold at the start of the alert. eg: "Could not authenticate with Twitter"
message A string to be shown after the title, usually to explain the alert or give the user some help. eg: "Please double-check your username and password"
level [optional] An optional boolean to control the appearance of the alert. True values will cause the alert to be shown in red (.alert-error); false values will cause it to be shown in the default yellow.

Example usage:

scraperwiki.alert("Could not authenticate with Twitter", "Please check your username and password", 1)

SQL API

Use scraperwiki.sql() to read data from the main dataset your tool is attached to.

The scraperwiki.sql() function takes three arguments:

Argument Description
command The SQL query to execute in the target box.
success(data, textStatus, jqXHR) A callback to run on success. This is a standard jQuery $.ajax() success callback.
error(jqXHR, textStatus, errorThrown) A callback to run on error. This is a standard jQuery $.ajax() error callback.

Example usage:

scraperwiki.sql("select * from tweets", function(data, textStatus, jqXHR) {
    console.log('Great! Here are your tweets:', data)
}, function(jqXHR, textStatus, errorThrown){
    console.log('Oh no! Error:', jqXHR.responseText, textStatus, errorThrown)
})

Make sure to use ANSI standard SQL. Although all QuickCode datasets are currently based on SQLite, it’s likely, at some point, we’ll allow PostgreSQL databases as well. You should avoid database-specific SQL. For example:

Try to… Good Bad
Use " to escape column and table names (more info). SELECT id, "first name" FROM "table name" SELECT id, `first name` FROM `table name`
Drop tables rather than deleting the SQLite file. DROP TABLE "dirty_table" rm -f scraperwiki.sqlite
Use the Metadata API to discover table and column names. scraperwiki.sql.meta() SELECT * FROM sqlite_master

SQL Metadata API

Use scraperwiki.sql.meta() returns information about the tables and columns in the database.

The scraperwiki.sql.meta() function takes two arguments:

Argument Description
success(data, textStatus, jqXHR) A callback to run on success. This is a standard jQuery $.ajax() success callback.
error(jqXHR, textStatus, errorThrown) A callback to run on error. This is a standard jQuery $.ajax() error callback.

Example usage:

scraperwiki.sql.meta(function(data, textStatus, jqXHR) {
    console.log('Great! Here is the database schema:', data)
}, function(jqXHR, textStatus, errorThrown){
    console.log('Oh no! Error:', jqXHR.responseText, textStatus, errorThrown)
})

Exec API

Use scraperwiki.exec() to execute arbitrary Unix commands inside your tool’s box.

The scraperwiki.exec() function takes three arguments:

Argument Description
command The shell command to execute in the source box.
success(data, textStatus, jqXHR) A callback to run on success. This is a standard jQuery $.ajax() success callback.
error(jqXHR, textStatus, errorThrown) A callback to run on error. This is a standard jQuery $.ajax() error callback.

Example usage:

scraperwiki.exec("cd; ./code.py", function(data, textStatus, jqXHR) {
    console.log('Code.py exited successfully:', data)
}, function(jqXHR, textStatus, errorThrown){
    console.log('Oh no! Error:', jqXHR.responseText, textStatus, errorThrown)
})

You’ll often want to escape quotes and special characters before sending them to the Exec endpoint. For this reason, scraperwiki.shellEscape() returns a string with single-quotes escaped (more for convenience than security).

Push SQL to other tools

scraperwiki.tool.pushSQL() takes two arguments, a query and a tool name. When called, the specified tool is installed and passed the specified SQL query. The specified tool will be able to access the query by reading scraperwiki.readSettings().sqlQuery

Argument Description
query An SQL query to pass to the tool.
toolName The name of the tool to be installed.

Styling

It’s important that tools share the same styling and fit in with the rest of QuickCode. If you’re writing your own tool, you’ll want to include Bootstrap and our custom style sheet:

<link rel="stylesheet" href="//scraperwiki.com/vendor/style/bootstrap.min.css">
<link rel="stylesheet" href="//scraperwiki.com/style/scraperwiki.css">
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="//scraperwiki.com/vendor/js/bootstrap.min.js"></script>

Write your HTML in a way that works well with Bootstrap.

You can use the QuickCode stylesheet preview page, to see how standard Bootstrap styles are rendered on QuickCode.

You might also want to read the Zarino Interface Guidelines (ZIG), which lay down a few conventions that QuickCode tools should follow.


Developing your tools

You can see and edit the code of any tool you’ve installed on QuickCode – whether you originally wrote it or not!

Because tools are simply git repositories in a box, and boxes are just shell accounts, there are a number of ways you can develop your tools. They are built to fit around your usual development process.

Develop locally

The simplest way to develop locally is to use a command line utility called swbox. It’s essentially a wrapper around rsync that makes it easy to create a local copy of a tool, and then push local changes back up to the box on demand. Like so:

swbox clone c7bc6s5
cd c7bc6s5
vim tool/http/index.html # make some changes to a file
swbox push

If you want to use version control, you can then git commit and git push inside your local copy as you would normally. Click here to download swbox and read the documentation.

If swbox is too complicated, you can always just use rsync (or scp):

# turn ~/local-directory/ into a copy of the box /home/ directory
rsync -avx --itemize-changes c7bc6s5@box.scraperwiki.com:. ~/local-directory/
# push changes back up from ~/local-directory to the box
rsync -avx --itemize-changes ~/local-directory/ #{boxName}@box.scraperwiki.com:.

If you prefer a point-and-click solution, try accessing your box using SFTP and an FTP client like Filezilla, Cyberduck or WinSCP:

Protocol: SFTP (not FTP)
Server: free.scraperwiki.com
Username: Your box_name (eg: c7bc6s5)
Port: 22

Develop on the box

Alternatively, you can SSH into the box, edit the code there, and commit it. Git needs to know who you are. To automatically tell it for every box, add this to your local ~/.ssh/config.

SendEnv LANG LC_* EDITOR GIT_*

And add this to your local ~/.bashrc or equivalent.

export GIT_AUTHOR_EMAIL=example@example.org
export GIT_AUTHOR_NAME='A. N. Example'
export GIT_COMMITTER_EMAIL=$GIT_AUTHOR_EMAIL
export GIT_COMMITTER_NAME=$GIT_AUTHOR_NAME

SSH will then pass those environment variables through, and Git will use them to record when you make commits inside the box.

Top tip! You can pass any environment variables you like to your box using SendEnv. For example, use EDITOR to get the text editor that you’re used to. Make sure you export the variable.

For less important code, just use version control locally inside the box. For more important code, add a remote and push/pull to it. Full instructions are in GitHub’s help on Creating a repository. In short, do this:

git remote add origin https://github.com/username/teach-example-tool.git
git push origin master

You’ll find you need to keep typing in your GitHub password, or set up a new SSH private key for each box. Agent forward is a much better and easier solution. In short, set up a local SSH agent, and add this to ~/.ssh/config:

Host box
    HostName free.scraperwiki.com
    ForwardAgent yes

Full instructions are in GitHub’s help on Using SSH agent forwarding.

Top tip! As a bonus, you can also then simply do ssh <box_name>@box to connect to any box.


Internal box settings

A box is a Unix user account on QuickCode’s server cluster. The Unix user account has the same name as your box (eg: by227hi) and exists inside a Chroot jail for security and privacy. Your home directory is always /home/.

Because boxes are just Unix user accounts, all your favourite Unix tools like scp, git, and cron work right out of the box. You have a permanent POSIX filesystem.

At any one time your box is allocated to one particular server, but that server might change from time to time. Occasionally we may need to migrate your box to a new server (when QuickCode has to create or destroy servers). In that case we may kill running process on your box. Any new processes that your box creates (via the exec endpoint or cron) will run on the new server, so ps will still list all the processes.


Unix services

Standard Unix tools included in every box:

  • Languages such as Python, R, Ruby, PHP, Node, Java and Clojure.
  • Scraping libraries such as Mechanize, Hpricot, Highrise, Zombie.
  • Data using libraries such as Zombie, NLTK, iCalendar.
  • Version control software such as git, Subversion and Mercurial.
  • Useful tools like GNU Screen, strace and curl.
  • Editors like vim and Emacs.

You can install anything else you need in the box. Use a language-specific package manager (e.g. pip, gem) or download the source and compile it. For example, to get a Python package, do this:

pip install --user <packagename>

Schedule your code to run using Cron:

You can create a standard cron job using the crontab command. The MAILTO variable works as normal, and is a good way to get cron output via email.

Cron jobs run on the same server as the rest of your box; as noted above the particular server may change from time to time but at any moment all of your box's processes will be running on the same server.

For technical reasons we encourage you not to schedule your cron jobs for specific times (such as 10 minutes past the hour). Use @daily and @hourly instead. We may not honour crontabs that run jobs at specific times of day.

Send email (not spam!) using SMTP:

You can send email using SMTP via port 25 on localhost. This is intended for logging and alerting. Spam is against our Terms & Conditions, and will not work.

Access your code using SSH, SCP, SFTP, rsync and Git:

Your box is an ordinary SSH server, meaning you can use scp, sftp, git over SSH and so on. Your keys are stored separately for each box in /home/.ssh/, so you can add and remove people.


Box API reference

Watch out for 403s!We currently only supply Access-Control-Allow-Origin in our API responses, no other CORS headers. Don’t preflight your AJAX requests.

HTTP fileserver

Files placed in the ~/http/ directory of your box will be served statically via the box’s HTTP endpoint. So, a file at ~/http/index.html will be accessible at https://free.scraperwiki.com/<box_name>/<publish_token>/http/index.html

The HTTP file endpoint is a great way to serve static web pages, client-side javascript apps, and downloadable files – especially when you’re writing views, or setup screens for more complex datasets.

The file’s MIME type is worked out from its extension, using a standard set of rules supplied with nginx.

CGI endpoint

All boxes have a CGI endpoint. Any file that is executable and is in the /home/cgi-bin directory will be made available to execute as a CGI script (usually these will be Python CGI scripts or similar, but you can write them in whatever language you like, and they can be compiled executables).

The URL will be https://free.scraperwiki.com/<box_name>/<publish_token>/cgi-bin/your-executable.

Just like any other CGI provider, you can pass query parameters (after the ? in the URL). The CGI script will be executed with the full privileges of the box shell account, and can do anything you could do by SSH'ing into the box.

Symbolic links will be followed, meaning that you can have a python script called query.py and a symlink to it called query which will lead to slightly nicer URLs.

You must print a `Content-Type:` header, any other headers you need, and then a blank line. Python `cgi` module documentation will prove useful if you’re writing a Python CGI script.

CGI-bin responses default to a HTTP 200 status. If you want to return a different status, you can print your own `Status:` header.

In general, your scripts will look like:

#!/usr/bin/env python

import cgitb
import cgi

# Optional:
# enable printing of tracebacks
# when something goes wrong
cgitb.enable()

# Required:
# Set a content type
print "Content-Type: text/plain"

# Optional:
# Set a non 200 HTTP status code
print "Status: 418 I'm a teapot"

# Required:
# Print an empty new line,
# to tell the browser that headers
# have ended, and body is beginning
print

print "Hello world"

# Optional:
# Get parameters from query string
data = cgi.FieldStorage()
print data['some_key'].value

CGI requests will be limited in the same way that requests to the exec endpoint are: Only a few in-flight requests are allowed.

SQL API

If there is an SQLite file ~/scraperwiki.sqlite, you can query it using the read-only SQL endpoint like so: https://free.scraperwiki.com/<box_name>/<publish_token>/sql?q=select+*+from+sqlite_master.

The SQL endpoint accepts only HTTP GET requests with the following parameters:

GET parameter Description
q The SQL query to execute. Multiple queries separated by a ; are not allowed.
callback [optional] A callback function with which to wrap the JSON response, for JSONP output.

The SQL endpoint returns a JSON list of objects; one object for each row in the result set. Although JSON officially has no order, the keys in these objects are in the same order as the columns in the SQL database.

SQL Metadata API

You can see metadata about your database by making a HTTP GET request to https://free.scraperwiki.com/<box_name>/<publish_token>/sql/meta. The meta endpoint returns an object like this:

{
  "databaseType": "sqlite3",
  "table": {
    "deals": {
      "columnNames": [ "deal_id", "ref_no", "deal_name", "status", "created", "updated", "price" ],
      "type": "table"
    }
  }
}

Exec API

You can execute commands remotely, without SSHing in, by using your box’s /exec endpoint. The Exec endpoint accepts HTTP POST requests with two required body parameters:

POST parameter Description
cmd The Unix command to execute inside the box. Multiple commands separated by a ; are allowed. Commands are run from /.
apikey The API Key of the box owner.

Watch out! The Exec endpoint allows potentially destructive access to your box. Never share your API Key with anyone.

Because the Exec endpoint is secured using your apikey, there is no need to provide a publish_token in the URL. Eg:

$.ajax({
  url: 'https://free.scraperwiki.com/example/exec', // note: no publish_token
  type: 'POST',
  data: {
    'cmd': 'echo "hello world" > hello.txt; ls -hitlar',
    'apikey': '<your-apikey>'
  }
}).done(function(text){
  console.log(text)
})

Unlike the other box endpoints, the Exec endpoint returns plain text, rather than JSON.

File upload API

Boxes come with a file upload endpoint, allowing you to write datasets or views that accept a user’s files as input. The file upload endpoint accepts HTTP POST requests, and like the Exec endpoint, requires your apikey as a body parameter:

POST parameter Description
file The file you wish to upload.
apikey The API Key of the box owner.
next The URL to which users will be redirected once the files have been uploaded. A filePath value containing the upload file’s name is added to the settings hash, accessible via scraperwiki.readSettings().

You will often use the file upload endpoint as the action attribute of a web form, like so:

<!-- in /http/index.html -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://scraperwiki.com/js/scraperwiki.js"></script>
<form id="up" action="../../file/" method="POST" enctype="multipart/form-data">
  <input type="file" name="file" size="80" id="file">
  <input type="hidden" name="apikey" id="apikey">
  <input type="hidden" name="next" id="next">
  <input type="submit" value="Upload now!">
</form>
<script>
  settings = scraperwiki.readSettings()
  $('#next').val(window.location.pathname + 'done.html' + window.location.hash)
  $('#apikey').val(settings.source.apikey)
</script>

The uploaded file will be put in the /home/incoming/ directory.

Dataset status API

QuickCode can be informed of the status of your datasets (for example, when they last ran, and whether they encountered errors). You can register the status of your dataset by making a HTTP POST request, from within the box, to https://app.quickcode.io/api/status. The dataset status will be shown on the QuickCode website, in users' dataset lists.

POST parameter Description
type Your dataset’s status. Should either be ok or error.
message [optional] An optional description, such as Scraped 24 new tweets or Invalid password. If not specified, the QuickCode website instead shows "Refreshed" or "Error" followed by how long ago the status was updated.

When you POST to the status API, the tools that are related to that dataset will have their "update" hook executed. This is located in the tool/hooks directory and should be made executable, e.g. chmod +x tool/hooks/update

This API requires no publish_token or apikey because it automatically detects the credentials of the box from which it’s called. Magic!

The endpoint returns an object with a single success key on success, or an error key on errors.

Here’s some example code in Python, calling the endpoint.

scraperwiki.status('ok')
scraperwiki.status('error', 'Source website broken')