Select your font size 
 
about us products & services consulting & support news & events contact us
Paul Meagher shows how to use a database query to calculate conditional probability.

Conditional Probability and SQL - USA

print this article 
 

P(A | B) can be mapped onto database-query operations. For example, the probability of cancer given a positive test result, P(+cancer | +test), can be obtained by issuing this SQL query then doing some tallies on the result set like this:

SELECT cancer_status FROM Data WHERE test_status='+test'

If I gather information about how several boolean-valued tests co-vary with a boolean-valued diagnosis (like that of cancer or not cancer), then I can perform slightly more complex queries to study how diagnostically useful other factors are in determining whether a patient has cancer, such as in the following:

SELECT cancer_status FROM Data WHERE genetic_status='+' AND age_status='+' AND biopsy_status='+'

In the case of detecting e-mail spam, I might be interested in computing P(+spam | title_word='viagra' AND title_word='free'), which could be viewed as a directive to issue the following SQL query:

SELECT spam_status FROM Emails WHERE email_title LIKE 'viagra' AND email_title LIKE 'free' 

After enumerating the number of e-mails that are spam and have "viagra" and "free" in the title (like so):

count_emails(spam_status='+spam' AND email_title LIKE 'viagra' AND email_title LIKE 'free')

and dividing by the overall number of e-mails with the words "viagra" and "free" in the title:

count_emails(email_title LIKE 'viagra' AND email_title LIKE 'free')

I might arrive at the conclusion that the appearence of these words in the title strongly and specifically co-varies with the message being spam (after all, 18/18 = 100 percent) and this rule might be used to automatically filter such messages.

In Bayes spam filtering, you need to initially train the software in which e-mails are spam and which are not. One can imagine storing spam_status information with each e-mail record (for example, email_id, spam_status, email_title, or email_message) and doing the previous queries and counts on this data to decide whether to forward a new e-mail into your inbox.



Page:   1  2  3  4  5  6  7  8  9  10  11 Next Page: Frequency versus probability format

The content shown in this page was first published by IBM developerWorks and is reprinted with permission from Paul Meagher (www.datavore.com)


Most Recent Website and Regional Updates

 Remote Technical Support
Remote technical support is provided 24/7 by Transparen's staff, not only for data recovery, but also for proactive maintenance (i.e. detecting problems early, before they become disasters).

 
 Online Magazine Sales and Distribution System
Turnkey publishing system facilitates easy online recurring billing, content delivery. Publish using the same software as you use for the print edition.

 
 How to make a page that silently queries the server
Using IFrames and some JavaScript, your website can dynamically consult a database without needing to reload the page.

 
 Research Tools
Measure human resource allocation and collect data with the goal of determining patterns that will bring forward actionable insights which may lead to policy changes, saving money and improving quality of service.

 
 Process Evaluation Questions
Questions to help focus discussion about process improvement

 

Google
 
Web transparen.com

Contact Information

Related Information

 
  ALABAMA AL
ALASKA AK
AMERICAN SAMOA AS
ARIZONA AZ
ARKANSAS AR
CALIFORNIA CA
COLORADO CO
CONNECTICUT CT
DELAWARE DE
DISTRICT OF COLUMBIA DC
FEDERATED STATES OF MICRONESIA FM
FLORIDA FL
GEORGIA GA
GUAM GU
HAWAII HI
IDAHO ID
ILLINOIS IL
INDIANA IN
IOWA IA
KANSAS KS
KENTUCKY KY
LOUISIANA LA
MAINE ME
MARSHALL ISLANDS MH
MARYLAND MD
MASSACHUSETTS MA
MICHIGAN MI
MINNESOTA MN
MISSISSIPPI MS
MISSOURI MO
MONTANA MT
NEBRASKA NE
NEVADA NV
NEW HAMPSHIRE NH
NEW JERSEY NJ
NEW MEXICO NM
NEW YORK NY
NORTH CAROLINA NC
NORTH DAKOTA ND
NORTHERN MARIANA ISLANDS MP
OHIO OH
OKLAHOMA OK
OREGON OR
PALAU PW
PENNSYLVANIA PA
PUERTO RICO PR
RHODE ISLAND RI
SOUTH CAROLINA SC
SOUTH DAKOTA SD
TENNESSEE TN
TEXAS TX
UTAH UT
VERMONT VT
VIRGIN ISLANDS VI
VIRGINIA VA
WASHINGTON WA
WEST VIRGINIA WV
WISCONSIN WI
WYOMING WY
 
 
E C M | © 2003-2007 Transparen Corp.      

Standardized Services: Data Recovery Service / Creative Services / Premium Web Hosting Services / System Administration Tech Support Services
Recent Projects: Full-Service Mortgage and Financing Company / System to manage flights from Vancouver to Tofino / Photo exchange verification service
Our Vancouver BC Server Proudly Hosts: automated parking and revenue control systems, leafside lane at southlands, cost effective alternative power sources, Higher Grade Learning Centres, pacific forage bag supply, sunburst medical, neosonic design, roger mahler photography - passionate, intriguing, desirable, the connection between east and west, affordable flights to victoria and tofino, low interest mortgage brokers in vancouver, richmond, surrey, toronto, Toronto Calgary and Vancouver IT staffing and talent search
Alabama, Alaska, Arizona, Arkansas, California, Colorado, Connecticut, Delaware, Florida, Georgia, Hawaii, Idaho State, Illinois, Indiana, Iowa, Kansas, Kentucky, Louisiana, Maine, Maryland, Massachusetts, Michigan, Minnesota, Mississippi, Missouri, Montana, Nebraska, Nevada, New Hampshire, New Jersey, New Mexico, New York, North Carolina, North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, Rhode Island, South Carolina, South Dakota, Tennessee, Texas, Utah, Vermont, Virginia, Washington, West Virginia, Wisconsin, Wyoming