got net?

Kevin Hazzard's Brain Spigot

About the author

Welcome to Kevin Hazzard's blog.
E-mail me Send mail

Recent posts

Recent comments

Authors

Disclaimer

The opinions expressed herein are my own personal opinions and do not represent my employer's view in anyway.

© Copyright 2010

English Words Database from 11 Sources

I am working on a project where I needed a list of English words in a Microsoft SQL Server database. I found some public domain lists of English words at:

ftp://ftp.ox.ac.uk/pub/wordlists/dictionaries

There are 11 interesting word lists here including:

  • Unabridged
  • CRL
  • Roget
  • Unix
  • Antworth
  • Knuth
  • KnuthBritish
  • Englex
  • Shakespeare
  • Pocket
  • UU.net

Most of these lists haven't been updated since the mid-1990s so if you find a more updated (free) source of English words, please let me know. I loaded all the data into a table that has these attributes:

  • [WordGuid] [uniqueidentifier] NOT NULL
  • [WordText] [nvarchar](30) NOT NULL
  • [WordLength] [tinyint] NOT NULL
  • [SoundexGroup] [nchar](1) NOT NULL
  • [SoundexValue] [smallint] NOT NULL
  • [GroupId] [smallint] NULL
  • [IsPalindrome] [bit] NOT NULL
  • [InUnabr] [bit] NOT NULL
  • [InAntworth] [bit] NOT NULL
  • [InCRL] [bit] NOT NULL
  • [InRoget] [bit] NOT NULL
  • [InUnix] [bit] NOT NULL
  • [InKnuthBritish] [bit] NOT NULL
  • [InKnuth] [bit] NOT NULL
  • [InEnglex] [bit] NOT NULL
  • [InShakespeare] [bit] NOT
  • [InPocket] [bit] NOT NULL
  • [InUUNet] [bit] NOT NULL

The [WordGuid] is actually the MD5 hash of the [WordText] expressed as a UNIQUEIDENTIFIER so it makes a nice universal primary key. I've precomputed the [WordLength], [IsPalidrome] and a couple of Soundex values to make querying the table a bit more efficient. I've also computed a [GroupId] for each word. Every word that shares a [GroupId] is composed of exactly the same letters in various orders. You could find all the whole word anagrams for a given word using the [GroupId] for example. Finally, I've created a handful of [In*] flags to tell me which word file(s) each word was sourced from. I've made the database available in two forms below:

Attachable (as MDF/LDF) Microsoft SQL Server 2005 Database (21.20 MB)

Tab-delimited CSV File with Table Creation Script (11.10 MB)

Please see the licenses in the files at the source web site listed at the top of this post. All of the licenses are academic and free for use but your company may want to read and catalog them for full compliance.

Enjoy!


Categories: Fun
Posted by kevin on Saturday, April 04, 2009 8:47 PM
Permalink | Comments (2) | Post RSSRSS comment feed

Comments

Nikos United Kingdom

Friday, May 08, 2009 11:17 AM

Nikos

this was great help thanks

emanueleferonato.com

Wednesday, June 17, 2009 4:58 PM

pingback

Pingback from emanueleferonato.com

Eight word lists to help you creating the perfect word game : Emanuele Feronato

Comments are closed