Restructure/Redesign

Announcements about the database, website, and plugins.
szsori
Site Admin
Posts: 1972
Joined: Fri Nov 03, 2006 5:23 pm

Restructure/Redesign

Postby szsori » Mon Feb 02, 2009 1:15 am

I've been thinking more and more about a restructure of the site. This would include reworking the database to be more robust, fast, and have better data integrity. We've been talking about this a bit behind the scenes and I think it should be opened to public discussion a bit. Most of these things will not affect everyday users, but would rather affect the developers using the API. However, I'm looking for all feedback on the subject. I'll start brainstorming here...

Database Server:
mySQL gives us flexibility and saves a lot of money compared to MS SQL. Its primary drawbacks are no true foreign key support, no native XML support, and limited fulltext search capabilities.

MS SQL does have awesome fulltext capabilities for search as well as native XML support, but would be an expensive option and limit our choices should we ever need to find an alternate host.

PostgreSQL has awesome fulltext capabilities but no native XML support. The main drawback is that I have no experience with it, but I'm guessing it would be no harder to learn than mySQL was.

Schema Design:
All columns would get unique names related to the table name. Right now we have an "id" field on almost all of our tables. In the new structure this would be series_pk, episode_pk, actor_pk, etc. This would make queries easier and make things far easier when we export data. It's also the "proper" method. All foreign keys would use the _fk suffix, like episode_series_fk.

All data would get stored procedures to list/detail/add/edit/remove data. This means that we would no longer have SQL code actually sitting out in the PHP code. Since the SQL would exist in the stored procedures it would run quite a bit faster as well and would lock down any sort of injection attack.

Data relationships would be accurately set, including proper cascading deletes. This means no more orphan records.

Data Revisions:
This is a big one in a number of ways, and I'd have to figure out the best way of doing it. I would basically work in the ability to roll back changes to the data, similar to how wiki does it. I think the most efficient way would be to have two versions of each table, one the "live" table and one the "revisions" table. When a record is modified the current record would be copied to "revisions" via a SP before the main record is updated. This would make for a much bigger database, but I think it allows some features that I was hesitant to allow before.

I'd have to look into how deletes would work with this. Perhaps the deleted record would just be removed from the live table while the revisions would be left in the revision table. We'd need some interface for viewing/recovering these records though... would require more thought.

API:
The API would be completely dynamic and would just allow direct access to views via stored procedures, including filtering capabilities. We'd have to address performance issues with this, but I believe if done right it would allow developers to access exactly the information they need without having to pull all of the other data.

One of the biggest changes would be allowing full post-back capability to the database via the API. This means that if an episode is missing an overview, the user's software could prompt the user to enter it after they're done watching the episode. Since we would have revisions for each record, I feel safe doing this. It would also allow full access to rating information for banners, episodes, etc and would allow users to modify their own information from their HTPC software (possibly even create an account there automatically). The end result is far more accurate and complete data and ratings.

Seasons:
Right now we only support a few types of "seasons". These include seasons based on the airdate (which we call "default"), dvd seasons, and absolute order. I'd like to allow dynamic season types so we can handle multiple releases on DVD, multiple airdate orders (if necessary), and true absolute ordering. If a series doesn't need a DVD order, it wouldn't exist until the DVD comes out... in other words, we'd do things right.

Comments and/or Live Chat:
For shows currently airing, we could run a basic IRC server that would allow clients to interact while watching the show. We could also possibly run it on another IRC network like Efnet. Channels for these episodes would be named based on the episode_pk so they might be something like #thetvdb_episode12345. By keeping it simple and using a standardized technology like IRC, developers could use IRC clients in their code so Media Portal users could chat with XBMC users (and others) during the series. The site would also include a Java IRC client to allow chat.

Comments would be included for each series and episode. To keep things simple, we'd do a non-thread discussion similar to Fark. This would be accessible via the API so a person can read/respond to comments directly from their HTPC software no matter when they watch the show.

Fusebox Site Design:
Right now the site sort of uses a fusebox, but I did things wrong the first time. Code for each set of SP calls and related functions will be put into a components directory. So we'd have components/series.inc, components/season.inc, etc. We'd then have view pages specifically for each section on the site, so pages/series.php, pages/season.php, etc. Finally, the index.php would include the security.php, header.php, footer.php, and proper page via a case statement and otherwise do very little.

Standardized Interface:
The site would be designed to use no pop windows, relying on separate screens and/or AJAX to display necessary data. When changes are made, the fusebox would accept a message and redirect url and pass that info to pages/message.php. This would simply display the message and redirect to the target url after a certain number of seconds. This is similar to how phpBB was changed to work, and it gets around various reload/refresh and variable issues.

The CSS would be greatly simplified and the pages would be designed to be as simple as possible while displaying the required information.

Copyrights:
To help protect the site and our users, we'd tag all overview dat with "data source" fields. I believe tvrage does this right now, where if you copy/paste from tv.com you actually select tv.com as your source. Data written by users (coming in via the site or API) would be tagged as custom. Custom overviews will not be able to be overwritten by other site sources.

All artwork will include fields for original source and credits, so artists can give proper credit whenever possible.

We'd label all custom site content with the proper copyright while noting that tv.com, tvrage.com, etc sources are copyright by those sites. This should allow commercial software to better use our database.

Premium Features:
I'd like to see some sort of premium features developed that would go directly back to the server providers, since they provide a great server and TONS of bandwidth. The site would be designed to include ad space that's unobtrusive but more profitable than the current Google ads (which generate very little money).

One possibility would be having tiered access to the data. I fully believe that the data should be as open as possible, but perhaps something asking software developers to place small unobtrusive tvdb-provided ad banners in their software for free tvdb users while hiding them for subscription users ($10/year?). I'm trying to find a way to get these guys revenue without being annoying for our end users or developers, so I'm open to any ideas people have.

More Community Features:
In addition to allowing people to comment and chat about stuff via the site, I'd like to allow people to become "fans" of series, actors, etc. Down the line this could be used to do more advanced recommendations for things people might like. It'd also allow us to make cool profile pages (public viewing toggleable) that people can link to with tvdb signature banners.

User View Tracking:
This is pretty simple, but I'd like to allow clients to send viewing data for end users. This would, of course, be toggleable by the user themselves (we'd make it a site requirement and/or build it into their profile settings). It would allow cool things like showing what percentage of each show they've watched, what episodes they are missing for a series, what their total rank is for tv viewing that day/week/month/year/overall, etc. It'd be something fun that I think would be relatively easy to do and would get people to visit the site more. :)

Other:
Other requested features would be incorporated right away as well, like batch imports, full database exports (stripping user data), etc.

That's where I am so far with my brainstorming. Keep in mind that if I start this, it'll take quite a while to complete. I would probably allow Coco to maintain the current site while I work on the replacement in the background, and then slowly allow people access to the new site for beta testing. When we go live with the new site, the current API would either be rewritten to be dynamic using mod_rewrite (meaning it'd appear to be the same, but would function differently behind the scenes) or we'd completely shut down the current API at that point. I'd hate to put the developers through that again, though. ;)

Give me as much feedback as possible please.

Coco
Site Admin
Posts: 2476
Joined: Tue Mar 13, 2007 10:16 pm
Location: Canada

Re: Restructure/Redesign

Postby Coco » Mon Feb 02, 2009 5:30 pm

Ok, I'm going to try and go through all of these.

For the database server I personally I would prefer to just stick with mySQL. The database isn't really that big compared to a lot of other mySQL databases so I think it can handle it just fine and it makes it easier to move the site which is something that will happen if we continue to grow. It does have foreign key support depending upon how we set it up. I don't personally think full text search is a major issue as even MS SQL's isn't a great way of doing a search, it just doesn't cover everything people expect from a search these days. If we really want to make the search work well we'd have to go with something more complex then using fulltext search, like figuring out sphinx or another program like it. My second choice would be PostgreSQL, I've only ever used it with pre built apps before though so I'd have to check it out a little more to be comfortable with it.

Edit: I just looked up the databases a bit and I don't think Postgresql is a good way to go for us. Speeds very important for this site and postgresql is noted for being slower then mySQL, running mySQL with InnoDB tables would give us the FK support we need. If we aren't going to with mySQL I'd have to vote for MS SQL even though I don't really want to go with windows servers.

As for the scheme design I would say what you've suggested is fine, but it's going to take a lot of work. If you've got the time to do such a massive change then it sounds like a good idea. If the whole site is being rewritten anyways it's definitely the time to do this, as there's some things in the current one I avoid fixing because I'm not sure how many places I'd have to patch the code if I did.

The data revisions seems like a good idea but we'll have to be a little more details in how records are saved I think. Right now a record is saved and every single field is updated with the information regardless of if it's changed. We'd have to change that so only changed fields are being saved otherwise trying to go back through the revisions would become such a pain because there would be a complete revision for every single time someone makes a minor edit to one field.

The changes to the api sound like a good idea but, how would we mirror that? As far as I can see this would mean mirrors have to have a DB running. Either that or we no longer mirror the api and only mirror images? Also I'm a fan of the post-back being built into the api but I'd list it as a lower priority as we wouldn't need it to run the site so I'd try and get other parts done first.

I'm not so sure about your suggestions for the seasons. Having multiple aired orders or DVD orders seems a little extreme to me. It would also be very hard to moderate as it would be hard to confirm other air dates for most countries. Limiting people to three possible orders really covers it right now IMO, one of them should be a preferred viewing order regardless of which country or language you speak. Adding more would add confusion and could mess up some of the records. The only time I'd really see a need for more than 3 orders is when episodes get split up and merged with other episodes. It's rare but sometimes the episode count itself changes which is hard for us to accommodated; I'm not really sure how to solve that problem though.

Comments and live chat both sound fine but I wouldn't worry about them right now. Most of our users are via their media software, not web access so I don't know how useful it would be. I'd open it up to comments first and allow post back from the media programs. If that catches on then add live chat later, but I'd list both as really low priorities.

As for the fusebox/standardized interfaces, nothing to really add, sounds fine.

I don't know copyright law at all, if this helps then sure we could do it but I don't think referencing tv.com or tvrage.com helps as they don't allow their information to be copied even if credited (could be wrong). If we do include the field remember that it would have to be filled in per language, so one description might have 20+ sources if it's in multiple languages.

I don't really have much to say about the rest of it. They seem fine and I'll most likely comment more on them later.
Please don't PM me. Post in the forums instead unless there is a very good reason for a PM or I won't respond.

Tara
Posts: 19
Joined: Sat Jan 17, 2009 9:59 am

Re: Restructure/Redesign

Postby Tara » Mon Feb 02, 2009 7:48 pm

I've used PostgreSQL, and found to to be every bit as easy to work with as MySQL, once you get the hang of a few subtle differences.

It's good enough for NASA, it's good enough for me. 8-)

szsori
Site Admin
Posts: 1972
Joined: Fri Nov 03, 2006 5:23 pm

Re: Restructure/Redesign

Postby szsori » Mon Feb 02, 2009 10:44 pm

I had another friend recommend PostgreSQL for the project based on his experience. He said it's about as easy to learn as mySQL. Apparently it's slightly more strict, but we're not doing anything outrageous anyway. Foreign key support is a must-have for the design. Right now we rely on code to cascade records if there's a delete. That's not how it should be... modern databases can maintain data integrity so why not allow it to handle that for us? Also, foreign keys will greatly speed up the massive joins we have to do for any useful information. mySQL is good and all, but it's painfully slow to do a simple join from 3 tables because we don't have the indexes set up right.

As for fulltext, the obvious goal is to eliminate the need for outside software like sphinx. Fulltext in PostgreSQL isn't that far off what sphinx can do, and it would easily handle the various stopwords (including translated stopwords) and punctuation. Most hosts that provide mysql also provide PostgreSQL, so that's not a limiting factor.

The speed is definitely a factor, so I'd have to look into that. I know that turning fsync off in PostgreSQL and doing things with stored procedures greatly increases the speed in PostgeSQL, but I don't know much beyond that. I'll talk to some of my db friends and hope that someone else posts here. :)

I think it would actually be easiest to save the entire record for the revisions stuff, but obviously that wastes storage. The total number of records would be the same no matter which fields we're saving. Saving the entire record means that doing revisions is as simple as having a revision table for each normal table, doing an INSERT INTO...SELECT statement in each UPDATE stored proc, and making a simple interface that lists all of the records in the revision table while only displaying the most recent one (or using AJAX to load past ones as desired). I dunno... this is something we should think about more because it's really important.

The API is currently only mirroring the images and zipfiles onto another server. We'd still have to generate the zipfiles to use those, which means we'd be able to mirror those as well. My goal would be to make it more efficient, so I'd want to think this over some more.

I'll have to think through the seasons more, but I think it could handle the situation you're talking about. The airdate order would obviously stay the same. When someone comes up with another order we'd just create a new one called "DVD" or whatever it is. Then we'd add existing episodes to the appropriate seasons and/or insert new episodes that didn't exist in other orders. It'd give full flexibility. I am aware that this might make things too flexible, so I think it's worth discussing more.

I agree on comments and live chat. Comments would be relatively easy to do and would add a lot to the site. When combined with post-back on the API, it'd be pretty nice.

Good point on the sources for different languages. I'll have to think about that. Not having sources listed for our content is a big issue right now because there's a number of companies that want to use the API for commercial apps and can't because of the copyright issues. I want to start resolving that and protecting the site. That way if tv.com says we need to remove all their summaries we can do so and replace them with another source if it exists in the revisions.

Additional note for everyone... I also want to add Hulu embedding (or similar) into the site. This would allow people to view the episodes right on our website. I have to look into Hulu's TOS to make sure that would be allowed, but I think it would be fine. Note that Hulu is US-only, so I'm open to other options that are more global.

Coco
Site Admin
Posts: 2476
Joined: Tue Mar 13, 2007 10:16 pm
Location: Canada

Re: Restructure/Redesign

Postby Coco » Tue Feb 03, 2009 2:22 pm

I looked into postgresql a little more. Most of the comparisons I read about it vs mySQL indicate that it has issues with large text fields. Although I can't seem to find the exact limit, I think it's something like 8000 characters but I'm really not sure. Could be an issue. Although the part that scares me a little is the documentation claims counts are slow because it doesn't use the index's when doing counts. If that's a huge issue I suppose we don't really need them as we only do counts on reports (edit: Ratings could be a real issue, if counts are really slow), but it would limit us on our reporting if it's a big issue.

As for actually learning to use postgresql I'm not to worried, basic SQL statements are going to be pretty much the same, in fact I think any sql statement that works in mySQL will work in postgresql. I don't believe it's true the other way around though.

Also for the data revisions, you might be right, we'll just have to save the whole row each time. Although we should be picky enough to tell that someone hasn't actually changed anything and just hit save. In those cases we needn't save a revision. It's also going to get a little tricky with all the translatiosn because when someone saves a series or episode we'll have to go through and grab all the forieng language information as well, so a single data revision could include 20+ rows.

Or maybe the new site could separate the languages a little more, so you can only edit one language at a time and you must switch to a diffrent language to actually edit in that language. If done right it might clear up some confusion. Not everyone understands the current method and I commonly see people remaking a series in another language and then copying information from the english one into it.
Please don't PM me. Post in the forums instead unless there is a very good reason for a PM or I won't respond.

szsori
Site Admin
Posts: 1972
Joined: Fri Nov 03, 2006 5:23 pm

Re: Restructure/Redesign

Postby szsori » Tue Feb 03, 2009 6:09 pm

I agree completely on the language portion. I think we'll set it up to only display the user's default language throughout the site, but then display English when their default isn't available for that field. Then we'd also include a message that says "This is in English because a Polish (or whatever) translation isn't available. Please click here to translate this record.". I think if done right it would streamline things a bit and make it a little more clear.

I'm not sure what to do about the database. I've got some database friends I can talk to, but I doubt they'll have any in-depth information. I'll spend a while researching later on today. mySQL may end up working fine if we can set up proper indexes and possibly do cascading deletes via stored procedures or triggers. I also still have to figure out a good method for recording deletes of various records, both for the API and the revision part of the site.

I also think we'll need emigrating to chip in since he's a valuable resource on the database stuff.

andre_pl
Posts: 13
Joined: Wed Feb 04, 2009 11:25 pm

Re: Restructure/Redesign

Postby andre_pl » Wed Feb 04, 2009 11:32 pm

Coco wrote:I looked into postgresql a little more. Most of the comparisons I read about it vs mySQL indicate that it has issues with large text fields. Although I can't seem to find the exact limit, I think it's something like 8000 characters but I'm really not sure. Could be an issue.


Postgres has no such limit, I believe the 8000 byte thing is only used to determine if the data is stored physically in the row, or outside of it on disk. the difference is transparent to the client. Postgres is a fantastic decision for a database, and any claims of it being slower than mysql are generally based on their performance many years ago, postgres is blazing fast, especially for large complex joins, and that is where it really shines compared to MySQL, I recently switched for a larger project which required PostGIS for geospatial queries and now i would never dream of going back to mysql for anything.

for the record, any potential speed gain w/ mysql is lost when you switch to myisam tables w/ proper foreign keys.
mysql is fast because it doesn't do anything right, when it tries to do things right, its generally slower than postges, which just always does the right thing.


BTW, Hi, I'm new. Developing an app using this awesome API.

szsori
Site Admin
Posts: 1972
Joined: Fri Nov 03, 2006 5:23 pm

Re: Restructure/Redesign

Postby szsori » Thu Feb 05, 2009 3:41 am

Important to note regarding the text field is that we have only 2 series with overviews longer than 8000 and only 20 episodes that long. We should really think about capping those at a much lower value at some point as well, since nobody needs a novel about these things. Something like 1000 characters might be plenty.

Andre, I find it interesting that you're using Postgres for geospatial queries since that's exactly what my friend is using it for. He's the one that initially suggested Postgres for me. :)

Thank you so much for your input on Postgres. I think it's something I'll look into a little more, but I feel more comfortable about switching if necessary after your post and after talking to another friend today. I'm still going to look into it some more, especially some of the different tools available for each, but your info was quite handy.

andre_pl
Posts: 13
Joined: Wed Feb 04, 2009 11:25 pm

Re: Restructure/Redesign

Postby andre_pl » Thu Feb 05, 2009 11:27 am

Yeah for GIS Stuff the only viable options are postGIS and Oracle, mysql has some extremely limited geospatial capabilities but they're not even worth mentioning. (Pretty sure it can't even do a proper 'within x radius' search. I plan on making extensive use of thetvdb api an upcoming opensource app, so if you guys want any help w/ the redesign/reimplementation i'd be glad to help. I'll be coding a python API over the next couple days too, and already ran into something I'm unclear on. so check the developer forum for me will ya :)

Cheers,
Andre

szsori
Site Admin
Posts: 1972
Joined: Fri Nov 03, 2006 5:23 pm

Re: Restructure/Redesign

Postby szsori » Thu Feb 05, 2009 5:24 pm

Ahh cool. I'll be posting my schema design soon and would love some more experienced DBA's to look it over. After that I think it's just a matter of making a code factory to generate the SP's and functions, and then actually piecing them together on the site.