Feature proposal: SQL Import
Synopsis
SQL Import will be an add-on to the existing Import tool that (see Import Re-design below) allows you to specify an SQL query to run against the database server that CiviCRM connects to. You will specify a different database schema to pull data from in the query, and this will generate a set of columns and rows to be used by the import tool just as though you'd uploaded a CSV file (which the tool currently uses). Eventually this will be extended to save import jobs (not just the field mappings) for manual and/or automatic re-running of the job. This, combined with CiviCRM's de-duping abilities, will allow for simple syncing of other systems' data with CiviCRM.
Use case
At Public Interest Network (U.S. PIRG, Environment America, others), we have ~100 different organizations with thousands of contacts each. Importing even one small-ish group's contacts (~5,500) through the web CSV tool is quite slow and error-prone. Browser timeouts, server timeouts, and database->CSV->import->database conversion errors are all an issue. By pulling the data directly from the database server, it can be faster and less error-prone, as well as requiring less admin personnel time to accomplish.
CiviCRM 2.2 Proposed Feature List
- Option to specify an SQL query instead of uploading a CSV file in the Import tool (will check to make sure it's a SELECT query).
- Once data is successfully pulled from query, the regular import tool takes over and works the same as if you'd uploaded a CSV.
Post-2.2 Feature List
- Ability to save entire SQL import jobs so they can be easily re-run later.
- Ability to schedule save SQL import jobs so they run on a recurring basis.
Import Re-design
After discussing this with dlobo and examining the code, it seems a re-design of the import system is required to make this work. I am planning to build a new import system that uses pluggable DataSource classes to pull data from external sources into a temporary database table and then runs the import from that. This creates some opportunities to speed things up as well, since you can use SQL commands in some cases to transform the whole table at once rather than operating on every row individually. I will attempt to take advantage of these opportunities as much as possible.
My plan is to implement a CSV DataSource (to duplicate existing functionality, but hopefully much faster) and an SQL DataSource (that works as described above). Interestingly this re-design will make the SQL DataSource one of the simplest DataSources.
DataSource classes will consist of a form snippet that gets called when a user selects that DataSource during an import. This form snippet will ask for whatever info is needed to grab the data from the external source. Once posted, the import system will callback to the DataSource class with a temporary table name and PEAR::DB connection. It will be up to the DataSource to put the to-be-imported data into that table (but it need not do any transformation beyond what is required to store it in MySQL--so you may have to format some dates correctly, for example).
This is a great idea.
Might be useful to document how this would work in practice for importing from different databases, some on different servers, some perhaps not residing in MySQL instances.
Currently I'm not planning to support separate database servers with this. However, the new pluggable DataSources should make something like this possible to write. In fact, it may just be an improvement / generalization of the SQL DataSource.