The Perfect Exchange - The Indexer

Now that I’ve officially dumped the Ruby MTP wrappers in favor of PyMTP, it’s time to do some deeper design work on the indexer.

Goals

The indexer needs to:

Move photos/videos from devices to the storage drive.
Keep a record of the media it has moved, including metadata such as relative dates and paths to artifacts.
Not duplicate media.
Be able to recover in the face of connectivity issues.

Process

There are three high-level phases that compose the indexer.

Device Detection
Media Transfer
Indexing

Device Detection

How do I know that a device was connected and that the indexer should start running?

Rather than making this element event-driven, it seems simpler to me to just have a CRON job that runs every minute or so. If the indexer is already running, it won’t start it again. This would spin all night though if I plugged in my phone and left it. Perhaps the job should only run once a night?

There are a few different device types that will need to be supported initially:

Android phones.
USB Drives.

Media Transfer

The indexer will download all media files to a staging area on the external disk. After each file is downloaded, it will be deleted from the device. This way, if the device is suddenly disconnected, at most one picture will have to be re-downloaded.

For Android devices, any media in the /sdcard/DCIM directory will be transferred. For USB drives, any media in the /to-index directory will be transferred.

There may be both pictures and videos, and I’ll want to support both.

Indexing

Once the media files have been downloaded, the actual indexing will begin. For every photo in the staging area, the indexer will:

Generate an MD5 hash based on the file contents.
Delete the file if it has already been indexed. (Is a duplicate, based on the hash.)
Generate a thumbnail for the image or video.
Move the original and thumbnail from the staging area to its appropriate location on disk.
Store the metadata in the Firebase index. Metadata will include:
- Path to photo. (original)
- Path to thumbnail.
- MD5 Hash.
- Date Taken.
- Date Indexed.
- Source Device ID of some kind.

It may make sense to split the process that is transferring media and the process that is indexing. We’ll see.

The final resting place of media will be in these locations:

[project root]/media/thumbnails/[year]/[month]/[day]/[hash].jpg
[project root]/media/pictures/[year]/[month]/[day]/[hash].jpg
[project root]/media/videos/[year]/[month]/[day]/[hash].jpg

Index Format

Initial approach:

{
  "photos": {
    "[unique id]": {
      "pathToArtifact": "media/pictures/[year]/[month]/[day]/[hash].jpg",
      "pathToThumbnail": "media/thumbnails/[year]/[month]/[day]/[hash].jpg",
      "dateTaken": "1438088526",
      "dateIndexed": "1438078510",
      "sourceDeviceId": "4l3kjlsdkj",
      "hash": "[hash]"
    }
  },
  "videos": {
    [Same as photos]
  }
}

The [unique id] will be generated by Firebase’s push() function, so photos will be ordered chronologically by the time they were indexed.
Firebase can be queried using the orderByChild() function to retrieve photos ordered by either of the date fields. I should index the data to improve performance.