= !OpenSubtitles v2 draft specification = == Introduction == Programming in python use [http://www.python.org/dev/peps/pep-0008/ PEP-8] code practices, [http://django-python.com/python-tutorials/tips-for-readable-python-code/ another good basics] == Subtitles section == We are trying to avoid duplicate subtitles as much as possible, so in ideal world, there should be only one subtitle for many releases. We try to approach this by using [http://bitbucket.org/danger/pysublib/ SubLib]. So, in system there will be saved only 1 subtitle for each version of movie: [Matrix], [Matrix - Extended Cut], [Matrix - Directors Cut] and so on. This is ideal situation and current world is not ideal. We know, there are many versions of rip, so in ideal world, there should be just some rules, how to change original subtitles to fit them to the movie version. For example: 1. Take [Matrix] subtitles id 123456 2. Change frame-rate from 25 FPS to 23.978 FPS 3. Add 3 seconds to the beginning 4. Cut subtitles at 1 hour 25 minutes 45 seconds 78 milliseconds and make 2 files With these rules we can represent any version, and hopefully any needs for movie. One big advantage of this system is wiki-style of subtitle editing, so, you look movie, you will find some bad translation or typo, you login to the site and edit these subtitles online (or using some program implementing our API). All these changes will be present for the all versions. So in the system will be one master subtitle for each version of movie (original version, directors cut). This will be beginning and in database we will got rules, how to "change" it for different movie rips. So, master subtitle is present in database and by analyzing timestamps of other uploaded subtitles we got rules for re-timing it to the another movie rips, that's theory. Wiki editing - users should be able to edit and translate subtitles online, with versioning system. All changes will be tracked, so in the final we will have in system how many changes was done by each user. Subtitles export - subtitles in the system are saved as metadata, so user can choose any subtitle format as he want. Realtime Re-timing, cutting - !SubLib supports re-timing, cutting, "moving" subtitles, so this should be done also online and via API. With subtitles store its encoding. Use [http://mirmodynamics.com/post/2008/12/17/Charset-detection-with-python charset detection] For language detection use !TextCat, Python [http://thomas.mangin.com/data/source/ngram.py implementation], [http://code.google.com/p/langdet/ langdet], google translate [http://www.catonmat.net/blog/python-library-for-google-translate python lib] == Movie section == Implement more than one website for movies, now is implemented only imdb.com [http://imdbpy.sourceforge.net/ python wrapper], which is not bad, but they don't provide any official API access to their database. That's why there is need to implement sites like [http://www.themoviedb.org/ TheMovieDB.org] [http://github.com/dbr/themoviedb python wrapper here] and [http://thetvdb.com/ TheTVDB.com] [http://pypi.python.org/pypi?%3Aaction=search&term=thetvdb&submit=search python wrapper]. So support 3 sites, imdb.com as last, when 2 other fails (?). Mark if movie is TV Series, add Season and Episode Movie hashing - there is little need for stronger hash, which need some research, how to done it properly, because current implementation (CRC64) is weak and can lead to collisions in future (so far there was no collisions). Ideally, system should be coded for more kind of hashes. I think wrong idea is put into hash itself information such is movielength, fps, dimensions of movie and such. Hash should be only file dependent, for example first and last 128kb (sha1), and filesize together hashed (sha1). Media information - in database there is need to save such information, but problem is implementation can be different in programs. The most important is FPS. == User Section == Registration - simple as possible - username, email, password, possible login using social sites, openid and so on (rpxnow.com). Groups and permissions - similar like in current version of opensubtitles. There are permissions, groups got 1 or many permissions, and user can belong to 1 or more groups. Django got good resources of libs for this. == Translator Groups == There should be some good support for subtitle translators and their groups. This need more research how to done it properly. * rating system * groups rating * stats == Upload section == Website upload: user should upload files first and then fill information needed. Why? Because we can help him to fill these informations - we can analyze subtitles for language, so analyzed subtitle language will be selected. Duplicates should be avoided, or very similar subtitles, so thats why it is wise to let user just select subtitles and upload and then fill needed information. == Request section == Subtitles should be requested as now. There should be progress in work, if subtitles for movie exists or not, and bidding (donating) for translation, so translators could choose their subtitles according money offer also. == Website Translation == Current system of website translation is OK. Need more comments from danger how it will be done. Just notes: * only some users will have access to translate webpage * there should be mails sent each friday morning, if there are some untranslated items, or items needs to be translated * mark in web translation module if some item needs to be retranslated (original text changes) == API Access == Only registered useragents will have API access by using their key. API should be provided by different standards such as XML-RPC, REST, JSON...good example of API is on [http://api.themoviedb.org/ TheMovieDB]. API [http://alexking.org/blog/2009/12/13/api-versioning-tip versioning] is a must. [wiki:API] == Caching == == Software specification == [http://www.lighttpd.net/ Lighttpd] as http server, running FastCGI [http://www.postgresql.org/ Postgre] SQL as database server, [http://www.python.org/ Python] as programming language, [http://www.djangoproject.com/ Django] as framework * [http://github.com/dcramer/django-sphinx Django-Sphinx] * Web Services * [http://code.google.com/p/django-rest-interface/ REST] * [http://code.djangoproject.com/wiki/JSON-RPC JSON-RPC] * [http://code.djangoproject.com/wiki/XML-RPC XML-RPC] [ * [http://code.djangoproject.com/wiki/GoogieSpell Google Spell] [http://www.mongodb.org/ MongoDB] for caching of subtitles [http://memcached.org/ Memcache] for memory caching [http://sphinxsearch.com/ Sphinx-search] for fulltext search new discovers * everything in application should be stored in document-oriented db, instead of relational one, because it is much more faster and much more easy extended. * [http://www.peterbe.com/plog/speed-test-between-django_mongokit-and-postgresql_psycopg2 mongo vs postgre] * [http://blog.boxedice.com/2010/02/28/notes-from-a-production-mongodb-deployment/ mongodb after year in production] * [http://ivoras.sharanet.org/blog/tree/2010-02-20.mongodb-and-durability.html interesting reading about mongodb] * [http://simonwillison.net/static/2010/redis-tutorial/ redis] is better for some special tasks than memcached - really worth to study * [http://lucene.apache.org/solr/#intro Solr] when using mongodb - this is better than sphinxsearch (maybe it is better completely) used django modules * [http://bitbucket.org/jespern/django-piston/wiki/Home django-piston] - creating APIs * [http://code.google.com/p/django-rosetta/ django-rosetta] - translation * [http://github.com/uswaretech/Django-Socialauth django-socialauth] - authorization from social networks * [http://djangopackages.com/packages/p/django-pagination/ django-pagination] - pagination of results (can't use core?) * [http://south.aeracode.org/ south] - data migration (when using MongoDB useless?) * [http://github.com/dcramer/django-sphinx django-sphinx] - sphinx lib, but we should use Solr instead * [http://github.com/robhudson/django-debug-toolbar django-debug-toolbar] - for debugging * [http://bitbucket.org/diefenbach/django-permissions] - object permissions admin, authh http://nosql.mypopescu.com/post/904840384/django-and-nosql-databases-revisited http://www.allbuttonspressed.com/blog/django/2010/08/Final-official-GSoC-Django-NoSQL-status-update http://botland.oebfare.com/logger/django/2010/8/31/2/ == Other notes == * Ratings should just be upvote and downvote. * Multiple-Subtitle !TryUpload support, with response for all requests sent. * Multiple Movie subtitle upload support (just allowing upload of more than one CD array) * Official support for subtitle upload without matched movie * Simple multiple deletion of subtitles * Community picked "best-version" of subtitle * maximum two subtitles per language per release * preview of subtitles will be AJAX style, fetched from cached subtitles == Database Design == Tables name and columns name are not final and will be changed. === Version 1 === [http://www.opensubtitles.org/addons/sql/?keyword=version1&toolbar=hidden Version 1] Not added tables for: * website translation * permission/groups users Notes: * subs_movie_data -> store raw python-serialized info from other website api response * == Discussion == === where to store subtitle files ? === It seems the best thing is to store METADATA of subtitle files to Database, because how users can translate subtitles online, when it is not in database? For Caching of subtitles (so subtitles doesn't have to be each time REGENERATED) we can use MongoDB === using of API key for useragent is good idea ? === if somebody wants hack it, he will just look for API key in other app and thats it. So what is difference to have enabled useragents and API keys? == Study ==