data science – Information Science

Databases are used to store all sorts of different information across many different fields. The health industry utilizes databases to solve a multitude of problems through research and analysis on the data. An example of a type of database that is used in the health industry is the genetic database. These are databases that contain information about an organism’s genes, DNA variants, and much more genetic information. By having access to thousands of different organism’s genetic information, scientists can determine patterns in DNA to conclude which genes might be responsible for certain effects in the organism. This method of analysis is helping scientists determine the cause of some of the world’s biggest biological problems, including the father of them all; cancer.

There is an extremely interesting project going on in the health industry called TACCO, which stands for Transcriptome Alterations in Cancer Omnibus. This is an example of a genetic database, containing information on altered cancer genes. Through analysis of this database, researchers can make conclusions about risk levels for different types of cancers in people. Through analyzing the empirical data stored in TACCO, scientists can come up with numbers for the risk probability of a person with certain genes developing a specific type of cancer. For example, if the database contains one thousand instances of a person with a modified gene type A, and ten of those people developed brain cancer, then a link between people of that gene type and brain cancer could be made. If a patient with the same modified gene type A is seen by a doctor, the doctor could use these statistics to determine that this person has a 1% chance of developing brain cancer. The data in this database can also be used to determine survival rates of people with certain types of cancer based on their genetic makeup. There might be an extremely low survival rate for a specific type of cancer, let’s say 2%. However, through analyzing the data in the database, researchers could find that every person who had gene type B survived this cancer. This means that if a person with this gene type is diagnosed with this specific cancer, there survival rate would be above the normal 2%. Even more crucial, this would allow scientists to study this specific gene and figure out what it is about it that allows the people who have it to survive. This research could ultimately result in a cure for that specific type of cancer.

The information in these genetic databases are looked at in different formats to determine patterns. While a pattern between a certain genome may not be evident when analyzing the database itself, it might surface when the database is represented graphically. The information needed to find the causes and cures of earth’s biggest diseases might very well be there in these already existing databases. Now researches and scientists must deeply analyze this data in different ways to extract the patterns hidden among the lines of data. We are getting closer everyday to reaching these cures, and information and data scientists are needed more than ever to find them.

Nick Bagley

Back in 1991, Guido Van Rossum introduced the world to his new programming language, Python. The language entered into mainstream usage quickly, but only several years later, creators Ross Ihaka and Robert Gentleman created another programming language, R. Since then, both languages have been used heavily in the data analysis field. But which language is better? Since 2013, Python is being used by nearly four times as many people as R. Python has the fourth most active usage on Github and Stackoverflow, while R lands at 15th. However, that does not necessarily mean that Python is a better language for data science.

R is completely centered around data and statistical analysis. Data can be analyzed in tables, and manipulated with simple strings of commands. R provides it’s users with a plethora of base functions to extract information from data sets, and by combining these simple functions it is easy to produce a more complex command. Typically, R is not taught as a first programming language because it is known to be more difficult than languages such as Python. However, once the basic syntax is understood, it is easy to dive into everything R can do.

Another advantage that R has over Python is it’s code repository. R has a massive availability of packages to install, all available at CRAN, the Comprehensive R Archive Network. Python has a similar repository called PyPi, but it is not as heavily contributed to. This wide selection of packages allows R to continue to grow, while Python does not focus as much on the usage of packages.

However, with all of these advantages, Python is beginning to raise in popularity, looking to overtake R. As seen in the diagram below, more people are switching over to Python than ever before.

The world is becoming a greater environment for engineering. It isn’t only computer scientists that know how to code now. People in all different fields use some sort of coding in their occupation. That is why the adaptability of Python is beginning to take precedence over the raw functionality of R. Python code is easy to read, which means that people in different parts of a business can understand the code, even with no real knowledge of computer science. Python can also combine data analysis with programming better than R can. Python is much more applicable to engineering purposes and development purposes than R is, and there is more development happening in the world than ever before. This is why Python is becoming dominant in the world of data science.

-Nick Bagley

Tag: data science

Genetic Databases

Python vs. R