Full Circle Magazine FR

Titre : Store Data Safely: An Intro To SQLite

There are basically two ways in which we can store data on disk. Historically, we had to set up a number of files. It was necessary for each program to issue reads and writes at the file level, in the correct manner, to ensure that the “database” was maintained in a correct state—so this was a kind of “database” but it had to be re-invented for each major application. The database was effectively hard-coded into the software, and a small change to the database could potentially require massive programming effort.

A true database is simply a structured collection of data. There are many varieties of database, but by far the most popular kind uses “Structured Query Language” or SQL as the language used for access and control. There are many implementations of SQL databases, and SQLite is a popular one. The choice of database depends upon the application; if it is networked or could involve many simultaneous update processes, then SQLite would not be recommended. However, at the “low end,” with non-networking and few simultaneous update processes, SQLite has many advantages—such as near-zero configuration, small memory footprint, and still gives a maximum database size of about 2 TeraBytes. It claims to occupy about 350 KBytes of memory. So SQLite is ideal for any PC. It is (perhaps) the most widely used database available—and it just happens to be open source. It may even be in use on your mobile phone…

With our relational database we can demand:- • Complex data structures. • Ease of use. • Simple access from any program language. • Database instructions should be in an English-like language. • Atomicity—if a transaction performs multiple updates to multiple files, it either performs all the updates (if it succeeds) or no updates (if it fails).

Luckily (for us), the relational database was invented some time ago (1970s), together with a simple language—“Structured Query Language” or SQL, with which to control and access it. All versions of SQL are broadly standardised, so we have to learn it only once.

An example of a simple SQL statement (to insert some data into a database):

INSERT INTO table1 VALUES(250,84,'The quick brown fox jumped over the lazy dog');

An SQL command starts with a verb, possibly followed by several qualifiers, and is always terminated with a semicolon. I have followed a common convention of writing SQL keywords in uppercase.

Apart from the zero-configuration aspect of SQLite, there is an additional advantage for the SQL neophyte: Firefox uses SQLite to manage internal caches, bookmarks, history, etc. So Firefox found it useful to write a GUI extension which makes it very easy to prototype and create your own SQLite database.

Virtually every programming language supports SQL—you can even run the program from the command line—although we shall prototype our database the simple way. I will give one or two simple examples of Python usage.

Hopefully, you use Firefox as your browser, as you need to install an extension (SQLite Manager 0.7.7) in order to load a full GUI interface which will allow for prompting, execution and testing of your SQL. Once you like what you see, you can transfer the SQL statements to your favourite programming language, which could be anything from BASH to BASIC.

To add the relevant extension to Firefox:- • From the Firefox menu click on the Tools > add-ons menu. • In the search box enter SQLite Manager; you should see one entry • Click on this and you should see a dialog box asking you to confirm your decision;

You will now need to restart Firefox for this to take effect.

We are now set to go—but we should do a bit of database design before diving in.

An SQL database has a very simple structure: it consists of a number of tables which look very much like spreadsheets; they have a preset number of columns (defined when the table was created), and an unlimited number of rows.

There are effectively no row labels; we use the data itself to relate a row of data in one table to a row of data in another table. There is no limit to the number of tables—although the number of tables rarely gets too large.

This probably raises questions of how can it work efficiently on large databases. It doesn't just plough through masses of data (although this can happen with bad designs). Behind the scenes, SQL maintains a number of indexes, some created automatically and some optionally created by the user. Many database operations are simply operations on these efficient indexes.

It's worth pointing out that a SQLite database is stored as a single large file—which makes backups (very important) a particularly easy task.

The Application

An artificial example:- suppose we have a bunch of students who will each be sitting a number of exams. We need to record the exams taken, the student name and address details (so we know where to post the results to). The accent is very much on simplicity!

I've tried to follow a simple naming convention:- to prefix table names with 'T', column names are in lower-case, and SQL statements are in upper-case.

We shall create four tables: Tstudent (one row per student) Taddress (one row per address line per student) Texam_name (one row for each exam set – a list of exam titles) Texams (one row per student per exam taken)

Note that we could have stored the address in the Tstudent table as a preset number of fields—but how many? There is bound to be someone who requires more lines for their address than we had planned for.

This process whereby a repeating field is moved into a subsidiary table is an example of “normalisation” and is an important feature of database design.

A similar reasoning justifies the creation of the Texam_name as we don't want to be forced to re-enter the full exam title for each student—we can just enter an exam identifier.

First we need to create our database, so it's time to launch Firefox, and, from the Firefox menu, click on Tools > SQLite Manager You will be presented with a sparse looking screen, so choose Database from the menu and click on the new Database entry:- We are going to call our database the “Exam” database, so enter Exam into the dialog box—the system will add .sqlite onto the end, and ask you to choose a folder to put it in—I suggest you choose your home directory.

This gives us a rather complicated looking initial screen (shown above).

The reason for the rather complicated screen is that SQLite builds its own database to hold metadata about our tables and any relationships between them. This is maintained internally by SQLite and we have relatively little access to it.

So it's now time to create our first table. In practice, it's a good idea to set out your tables on paper before entering them into the computer as there is rarely an absolutely “correct” way to design these things.

Tstudent will have four columns (these are sometimes called fields). This lists all the major operations we can do to tables—(Drop is SQL-speak for deleting a table).

We want to create a table; so we proceed and are presented with a form to define our table (shown above).

SQLite is different from other SQL databases in that there are only a few different data classes (INTEGER, REAL, TEXT, NULL, BLOB); no size is mentioned as SQLite will use only as much space as is necessary. A BLOB (“Binary Large OBject”) is any collection of binary data stored as a single entity.

We will be prompted to confirm that we really want to create this table—and we are shown what the SQL statement looks like to achieve this:

So now we have a properly formatted table, but it still has to be filled with data.

The student_id needs some explanation; we have described it as the “Primary Key”, which we have specified as being Unique. Here the student_id key is used as a reference to individual student information in this and other tables. By specifying the “Unique” property, the database will reject any attempt to add a row with a duplicate student_id value.

Primary keys may consist of more than one column—for example we could insist that the combination of (surname, forename) is our primary key, but this wouldn't allow two students with the same combination of names to be enrolled. Each primary key causes SQLite to maintain an index.

It is usual, but not a requirement, that a primary key has the Unique keyword.

Although we have mentioned primary keys, these are really the only kind of key to consider. It should be mentioned that there is a rather useful idea of something called a “foreign key” and we will give an example of its utility later.

We can now enroll some students (one at a time). The college has decided that the student_id is to be printed on all documentation, and will consist of a letter followed by seven digits. This is an important constraint which should (and could be) checked by the database. However, for the sake of simplicity, we shall assume that it is handled correctly by the user program.

Select the Tstudent table and take the add option:-

After entering your data, the system will prompt you with the appropriate SQL statement for adding a row.

However, entering data one line at a time will lead to very poor database performance. Because we didn't specify how many SQL statements form a logical transaction unit, SQLite assumes the single INSERT statement is an entire transaction. The system needs to lock the database against update, check there isn't a pre-existing row with the same unique key, write the data row, and flush the data to the disk to ensure the integrity of the database. All this activity is probably going to require a couple of rotations of the disk drive. We gain “atomicity” (a row is never partially written) but suffer a large performance penalty in disk-wait time.

The solution is to carry out a number of database updates as one logical transaction. We simply do a little more work and spend less time waiting for disks to respond.

We mark the beginning and end of the transaction with SQL statements

BEGIN TRANSACTION; … … COMMIT;

We can enter any reasonable number of statements (thousands) that make sense—covering multiple tables and various operations. When we reach the COMMIT statement, the database can be considered to have been properly updated. Alternatively, if an error occurs part way through (or we issue the ROLLBACK command), the database is placed back into the state it was in before we started our transaction.

So let’s add a few more students (we know we want to execute some SQL, and “Data manipulation” seems the most appropriate).

We get a prompt of

INSERT INTO tableName [(column-list)] VALUES(value-list)

Bearing in mind our successful one-row attempt, it seems reasonable to try pasting the following piece of SQL into the “Execute SQL” box. It's quite long, but it's just a lump of repetitive code (shown above right). One of those names is a real person (shown above).

However, when we attempt to run this piece of SQL, we get:

This is unexpected—but it turns out that Firefox always arranges for invisible BEGIN and COMMIT instructions to be inserted into any SQL that is run. The error (correctly) says that you can't nest transactions.

No programming language would ever do this for you; in a real-life situation, you would want to intercept some errors, and either correct them or simply accept them as errors which are not serious enough to cause a transaction to fail.

In this case, we just remove the BEGIN and COMMIT statements and it runs perfectly. There is a very simple SQL command to display all rows and all columns of a table—so we'll run it to prove our table contains the “right stuff” (shown on the next page, top left).

So far, we have managed to create just one table. It is only when we have several tables that we can map the relationships between them. You will be pleased to learn that the Firefox add-on allows you to import an initial data load if you have appropriate data available as CSV, XML or SQL source; it is also possible to export a table to these file types too. We still need to write the SQL statements to create the tables, but we can import our base data from spreadsheet(s).

For Taddress we shall specify the columns as follows:

Note that we could have specified TWO columns for our primary key and the corresponding SQL command generated would be:-

CREATE TABLE “Taddress” (“student_id” CHAR NOT NULL UNIQUE , “line_no” INTEGER NOT NULL UNIQUE , “address” CHAR, PRIMARY KEY (“student_id”, “line_no”));

However, we choose to use a feature of SQLite, provided that we have only one primary key column and that it is of type INTEGER, we can ask for it to be auto-incremented on each write to the table. You may have noticed that a column (called rowid) is automatically inserted in each row of a table; by using autoincrement we get an ascending sequence of line numbers (possibly with gaps in between).

Since it would be meaningless to allow the entry of the address of someone who is not in the Tstudent table, we need to prevent this. This kind of problem is described as a “referential integrity issue,” and it could be prevented by relevant program coding. However, the problem is really intrinsic to our database design, and a database with dozens of tables would suffer many more referential integrity issues.

The solution lies in entering the data rules into the database, so that the database can enforce data integrity (note that although the same name of student_id was used in both the Tstudent and Taddress tables, this doesn't mean that there is any implicit connection between the two tables).

This is the situation alluded to earlier of a “foreign key”—here we would want any attempt to update Taddress to check that the student_id already exists in Tstudent. We can think of Tstudent as being the “parent” whilst the corresponding multiple address rows are “child” entries. Potential problems could also happen if we were to delete a row from the parent, but we can instruct the database to automatically delete all the child entries.

Unfortunately, the Firefox GUI does not support the generation of SQL for foreign keys. Furthermore, it actually has foreign key supported switched off (by default) this is easily fixed by telling Firefox to run a command to enable foreign keys! It is important to type the PRAGMA foreign_keys=ON; correctly, and press save. You need to do this only once for each database you create, as future connections to the database will automatically run this PRAGMA command.

The correct syntax for the foreign key specification can be rather complicated. With a little practice, this is straightforward; the following will fix the situation:

DROP TABLE “Taddress” ; delete the existing table and data…

CREATE TABLE “Taddress” (“student_id” CHAR NOT NULL ,“line_no” INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL ,“address” CHAR, FOREIGN KEY(“student_id”) REFERENCES Tstudent (“student_id”) ON DELETE CASCADE);

This will cause updates to check that the FOREIGN KEY(“student_id”) (in Taddress) REFERENCES (matches) in the table Tstudent (“student_id” column). The ON DELETE CASCADE phrase will cause any deletion of a student in the parent (Tstudent) to delete all relevant instances from Taddress.

We can summarise the major constraints we can place on a column: • NOT NULL • UNIQUE • PRIMARY KEY • FOREIGN KEY • CHECK e.g. CHECK(price > 0) • DEFAULT (the value to be used if none is supplied)

We can try and add an address line for a non-existent student:-

Note that only two input parameters are needed

This gives an error message—foreign key constraint failed.

An attempt to add a valid student will succeed.

We don't really care whether the line number of the address is 1, 2, 3, … or not—just that it is increasing. We have arranged for the database to do it and save some work in our program.

A simple example (in Python) of adding just a few address lines would be:

Note that the interface converts a Python “None” into an SQL “Null”, there is no error handling, and the address line sequence number is generated (correctly) by the database.

We now need to create a new, super simple, table which will be a “codes file”; it associates exam codes with their titles. Since it's so simple, we can just type it into the “Execute SQL” area without any prompting.

CREATE TABLE Texam_name (exam_id CHAR PRIMARY KEY NOT NULL UNIQUE, exam_title CHAR);

Because we have the data available, we shall load the initial data from a CSV file (created from a spreadsheet):

We now have a number of tables and we are in a position to use one of the most powerful features of SQL—when we combine two or more tables together. To do this we use the JOIN keyword. There are two main varieties of JOIN available – by far the most common type of JOIN is a matching columns JOIN (technically an INNER JOIN). You can guess the name of the other kind of JOIN!

When specifying our JOIN, a problem that sometimes occurs is that we have used the same name for a column in two tables – but perhaps it has different meanings – in which case we need to qualify the column name by table name. As an example:

SELECT surname, forename, address FROM Tstudent AS S

      JOIN Taddress AS A                                            	ON S.student_id = A.student_id;

(The multiple lines are purely for readability), we enter this into the “execute SQL” section:- Note the optional temporary renaming of the tables to single letter names, purely so the full qualification of field names is abbreviated. We don't have much data in our database, so it looks like that shown below.

It happens frequently that we make a habit of making up new names for the same thing, so in fact our student_id means the same thing in the two tables. As a shortcut we have the idea of a “NATURAL JOIN”—where any matching names are assumed to be required to be matched; so an equivalent to the previous JOIN is:

SELECT surname, forename, address FROM Tstudent

                        NATURAL JOIN Taddress;

Note that a NATURAL JOIN can give unexpected results if extra columns are added to the tables at a later date.

If we expect to want to use our address list periodically, then we should create a View (using our SELECT statement above). There is no overhead in doing so, since it is just an instruction on how to view existing tables. However, it is important to realize that for the same reason, only tables can be updated—views are just a reflection of the current state of the tables; the data in views cannot be updated.

We can also create joins which are a JOIN of a table with itself, to create Views of a single table—perhaps to hide a particularly complex piece of SQL. It's also possible to join more than two tables in one statement—but then things tend to get complicated!

If we have a very big database—many thousands of rows in several tables—performance will suffer. In this case, we should look at how we are using our database, especially with respect to JOINs, and specify the Index → Create index option. This is likely to speed up queries, at a small penalty for database updates (we have to maintain each additional index).

The last major feature that can be used is the TRIGGER statement – where it is desirable that some database event should 'trigger' a sequence of actions. This is really a rather advanced topic and I refer you elsewhere!

We still have one last table to construct—the one that links the students with their exams.

Texams student_id exam_id score

The exam score will be zero if they have not sat the exam yet. To access the score, we need the combination of student_id and exam_id to be our unique (primary key). This is a slightly tricky case, because we don't allow a student to take the same exam twice, but the data structure means that we need to accept duplicate student_id's and duplicate exam_id's, but reject any combinations of the two which are not unique. This is a situation where a composite primary key has the effect we want.

So Firefox produces:

CREATE TABLE “Texams” (“student_id” CHAR NOT NULL, “exam_id” CHAR NOT NULL, “score” INTEGER DEFAULT 0, PRIMARY KEY (“student_id”, “exam_id”))

We should really declare student_id to be a foreign key so that it is validated against Tstudent and similarly exam_id should be a foreign key so that it is validated against Texam_name. So we delete (with DROP Texams) and execute the following SQL; as you can see, it is effectively similar to our previous encounter with foreign keys.

CREATE TABLE “Texams” (“student_id” CHAR NOT NULL, “exam_id” CHAR NOT NULL, “score” INTEGER DEFAULT 0,

CHECK (score >= 0),

PRIMARY KEY (“student_id”, “exam_id”),

FOREIGN KEY(“student_id”) REFERENCES Tstudent (“student_id”),

FOREIGN KEY(“exam_id”) REFERENCES Texam_name (“exam_id”));

(Actually, this is not quite correct as SQLite allows character values to be stored in numeric columns—so the CHECK statement would not work properly; we could fix this with a more complicated CHECK statement—but it wouldn't add anything to the readability. So we will leave it alone).

We can enter all of the test cases using INSERT, and they should all behave properly.

With our last table we are in a position to do something typical. The college has a few rather arbitrary rules, which change from time to time, so they don't really need to be enforced by the database. For example: “No student can enroll for M829, without achieving a pass in M823 previously”. This is easy to do in SQL with a little bit of decision in a program.

As a slight variation, I've supplied a version that runs in a terminal screen; it could easily be converted to a BASH script. Many programs can have the database open, concurrently.

jeremy@HECTOR:~$ # An example where student K1001001 wants to enroll on M829

jeremy@HECTOR:~$ sqlite3 Exam.sqlite # Open database

SQLite version 3.7.9 2011-11-01 00:52:41

Enter “.help” for instructions

Enter SQL statements terminated with a “;”

sqlite> SELECT count(*) FROM Texams WHERE (“student_id” = “K1001001” AND

 ...> "exam_id" = "M823" AND "score" >= 40 );

1

sqlite> INSERT INTO Texams VALUES(“K1001001”, “M829”, 0);

sqlite> .exit

jeremy@HECTOR:~$

The SELECT count(*) … statement is an example of one of the many built-in “aggregate” functions – here the number of rows that meet the conditions in the WHERE conditional, in this case a value of '1' is returned. So we can add a row (with a score of zero) for M829.

I shall give a number of references for more advanced use; but even if you want only a single keyed file or two—perhaps for an address book—a database is a big advance over some complex coding solution.

References:

http://www.sqlite.org/docs.html - The full reference; not a tutorial.

http://zetcode.com/databases/sqlitetutorial/ - A very concise SQL tutorial.

http://zetcode.com/db/sqlitepythontutorial/ - Python.

http://zetcode.com/databases/sqlitephptutorial/ - PHP version.

http://zetcode.com/db/sqliteperltutorial/ - Perl.

http://zetcode.com/db/sqliteruby/ - Ruby.

http://zetcode.com/db/sqlitecsharp/ - C#.

http://zetcode.com/db/sqlitevb/ - Visual Basic.

http://en.wikipedia.org/wiki/List_of_relational_database_management_systems - general interest only.