Elance Data Mining Test Answers 2015



Which industry can benefit from data mining?
All of these
Retail
Manufacturing
Finance/Banking


A function used by a node in a neural net to transform input data from any domain of values into a finite range of values is known as a(n):
Confusion matrix
Activation Function
Chi-square
Antecedent


Changes to parts of a code could lead to the problem of ______________ data.
granular
inconsistent
nonintegrated
dirty


What is data visualization?
A structured and developed prediction of data results
The visual interpretation of complex relationships in multidimensional data
The technical term for the act of data being stored in a server


Decision trees are able to handle missing values without using any impute transformation. True or False?
True
False


Data items grouped into relationships and preferences are known as:
Punctional Organizations
Predictable Sets
Degrees of Fit
Clusters


Which of the following is valid XML?
All are valid
<valid>This One</valid>
<valid>"This One"</valid>
<body answer="valid">This One</body>


A(n) _____ algorithm creates rules that describe how often events have occurred together.
associative
artificial
pruning
CHAID


What are decision trees?
Complex reports generated by a qualified data scientist
Structures that generate rules for the classification of a dataset
Hierarchical dimensions that can be created with a hyper cube browser
Data not collected by the organization, such as data available from a reference book


Which of the following is not a relational database?
Google Big Table
MongoDB
All of the above
Apache Cassandra


In predictive models, the values or classes to be predicted are called the:
Response
Target variables
Dependent
All of these


In a neural net, to what does topology refer?
The number of layers and the number of nodes in each layer
The graphical visualization of the data
The range of variables in a set
The number of nodes utilized


You are a credit risk manager of a retail bank. Some information about customers are available to analytics. Based on this data you have to decide that a person will be a good or bad customer. Choose the appropriate data mining task for this business problems.
Segmentation
Classification
Regression


Which of the following clustering algorithms can find clusters of arbitrary shape?
None of these
Single-Link
DSBSCAN
Both of these


True of False? Loose coupling data mining architecture is mainly for memory-based data mining systems that does not require high scalability and high performance.
True
False


What is CRISP-DM?
Microsoft's linear regression algorithm
A decision tree developed in the 1980's but almost entirely replaced by the CART method today
A six phase method for predicting e-commerce buying habits
A cross-industry standard process for data mining


The annual revenue of an international company is correlated with other attributes like advertisement, exchange rate, inflation rate etc. Having these values (or their reliable estimations for the next year) the company have to calculate its expected revenue for the next year. Choose the appropriate data mining task for this business problem.
Regression
Classification
Segmentation


Which of these are NOT considered internal data factors?
Product Positioning
Economic downturns
Staff Skills
Price


With which of these layers does a neural network start?
Transparent layer
Output Layer
Hidden Layer
Input layer


What is the measure of how much two random variables change together?
covariance
stochastic inertia
polyconvergence
binary standard deviation


A hyperplane is a
decision boundary separating classes of data
non-terminating error condition
collection of linked hypertext files
variant of the C4.5 algorithm


Suppose that the company's marketing department collects data from customers. Make customer groups to ensure that the most appropriate group to target the different offers. Choose the appropriate data mining task for this business problem.
Classification
Regression
Segmentation


The level of the model that specifies (often graphically) which variables are locally dependent on each other.
Qualitative Level
Structural Level
Primary Level
Quantitative Level


To increase the confidence of your state of classification performance on the entire population, you should:
Decrease the size of the training dataset
Increase the size of the test dataset
Decrease the size of the test dataset
Increase the size of the training dataset


Data not collected by the organization, such as data from a proprietary database, that is combined with the organization’s own data is known as:
Noise
Non-applicable date
Overlay
Overfitting


Which data mining technique organizes sets of data into predefined groups?
Classification
Sequential Patterning
Clustering
Gamification


What is the front end layer of data mining architecture?
Firewalls established to protect data from malicious sources
The team of programmers who designed the software utilized in a particular mining project
An intuitive and user friendly user interface
The hardware designed specifically for storage of massive amounts of data


Which of these is an example of a sequential pattern relationship?
Using business experience and gut instinct to design a new floorplan in a grocery store
Reorganizing your basketball team's starting lineup based on an analysis of performance
Placing two frequently purchased items next to each other on the shelf
Predicting the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes


In the analysis of time-series data, the mean value over a given time period (usually some interval in the past up to the present) is called a(n)
unbiased mean
compounded mean
partial average
moving average


True or False? Tests in CART are always Binary.
False
True


In the association between two variables, what is the difference between the antecedent and the consequent?
The antecedent is on the left, the consequent on the right
Nothing, they are interchangeable
The antecedent is on the right, the consequent is on the left.
The antecedent is always a very complex variable


The algorithm powering the Google search engine is:
The Brin-Page Method
PageRank
GoogleCrawler
AdaBoost


What is Dependency Modeling?
A task which consists of techniques for estimating, from data, the joint multi-variate probability density function of all of the variables/fields in the database.
Learning a function that maps a data item into one of several predefined groups or clusters.
A multi-step process involving data preparation, pattern searching, knowledge evaluation, and refinement with iteration after modification.
The process of finding a model which describes significant dependencies between variables


What is Regression?
Learning a function that maps a data item to a real-valued prediction variable.
A descriptive task where one seeks to identify a finite set of categories to describe the data.
An expression E in a language L describing facts in a subset FE of F.
Learning a function that maps a data item into one of several predefined groups.


What is Change and Deviation Detection?
A task focusing on discovering the most significant changes in the data from previously measured or normative values
The process of finding a model which describes significant dependencies between variables
A task which consists of techniques for estimating, from data, the joint multi-variate probability density function of all of the variables/fields in the database.
Methods for finding a compact description for a subset of data.


Sharding refers to:
a measure of the noise in a database's contents
none of the above
simultaneously accessing multiple object databases over SSH
partioning a database for distribution across different servers


Which of these is NOT a common descriptions of layers?
Functional
Output
Hidden
Input


What is the type of data mining that drives the Amazon.com recommendation system?
Clustering Algorithms
Anomaly Detection
Association Learning
Fuzzy Logic


Support Vector Machines have an advantage over Neural Networks because SVM's are
parametric
easier to train via online learning
none of the above
more resistent to local minima convergence


True or False? Economic indicators are external data factors.
False
True


Which of these are NOT types of analytical software:
Neural network
Machine learning
All are valid types
Statistical


Which of the following algorithms is generally suitable for unsupervised learning tasks?
k-nearest neighbor
info-fuzzy networks
k-means algorithm
Restricted Boltzmann machine


Which of the following storage solutions is most appropriate for a semi-structured dataset whose members do not all have the same attributes?
MariaDB
SQLite
MySQL
MongoDB


In order to estimate classification performance on an entire population, you need _______
(None of these)
disjoint training and test datasets
Test Datasets
Disjoint training


What is the extraction of useful if-then rules from data based on statistical significance?
Preliminary Method Mapping
Dynamic Information Inference
Fuzzy Logic Application
Rule Induction


What is a KDD Process?
K-mean Data Discovery
Knoop-hardness measured through high-impact dimension
Differential Decryption
Knowledge Discovery in Databases


Generalization error is a consequence of
Overfit
Underfit
Poorly defined Chernoff Bound
Parametric analysis


Which of the following is NOT a common source system?
DB Connect
UDC
SAP source
Node


Which of these are evolutionary computational methods?
Clustering algorithms
Bayesian inference algorithms
Genetic algorithms
Heuristic algorithms


True or False? The MARS algorithm cannot produce rules.
True
False


A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset is:
Nearest Neighbor
Decision Treeing
Logistic Regression
Association Model Query


Which of the following is most appropriate for finding the shortest chain of friends linking two people in a social graph who are not friends with each other?
Neural Networks
k-means algorithm
Dijkstra's algorithm
Markov chains


Which of the following is NOT a function of data warehouses?
Extracting data
Cleaning data
Cleaning dirty data
Storing purchased data


What is Interestingness?
A discovered pattern that is true on new data with some degree of certainty, and generalizes to other data.
An overall measure of pattern value, combining validity, novelty, usefulness, and simplicity.
An expression E in a language L describing facts in a subset FE of F.
A multi-step process involving data preparation, pattern searching, knowledge evaluation, and refinement with iteration after modification.


What is a genetic algorithm?
A classic algorithm for frequent item set mining and association rule learning over transactional databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database.
An algorithm that estimates how well a particular pattern (a model and its parameters) meet the criteria of the KDD process. Evaluation of predictive accuracy (validity) is based on cross validation. Evaluation of descriptive quality involves predictive accuracy, novelty, utility, and understandability of the fitted model. Both logical and statistical criteria can be used for model evaluation.
A search algorithm that enables us to locate optimal binary string by processing an initial random population of binary strings by performing operations such as artificial mutation, crossover and selection.


In the MapReduce model, Map and Reduce functions act directly on which kind of data structure?
MySQL matrices
key-value pair
linked lists
relational databases


Which of the following is not a common goal of the KDD Process:
Description
Performance
Prediction


Which of the following disciplines overlaps Data Mining?
Artificial Intelligence
Statistics
All of the above
Linguistics


What is Classification?
Learning a function that maps a data item into one of several predefined groups.
A descriptive task where one seeks to identify a finite set of categories to describe the data.
Methods for finding a compact description for a subset of data.
A discovered pattern that is true on new data with some degree of certainty, and generalizes to other data.


Which of the followng clustering algorithms can optimize an ojbective function?
DSBSCAN and Single Link
k-means only
k-means and CLARANS
Subspace Clustering Algorithms


What is Clustering?
Learning a function that maps a data item into one of several predefined groups or clusters.
A descriptive task where one seeks to identify a finite set of categories to describe the data.
A task which consists of techniques for estimating, from data, the joint multi-variate probability density function of all of the variables/fields in the database.
The process of finding a model which describes significant dependencies between variables


Which of the following properties applies to Single-Layer Perceptrons?
backpropagation
continuous output
random initalization of weights
able to learn non-linear separations


In Natural Language Processing, what is the role of a lexical analyzer?
generates a context-free grammar
splits the stream of input characters into tokens
processes the parse tree for semantic meaning
checks the validity of a token


A DBMS reduces data redundancy and inconsistency by
Enforcing referential integrity
Utilizing a data dictionary
Minimizing isolated files with repeated data
uncoupling program and data


In which type of analysis is a Kohonen feature map typically employed?
Predictive analysis
Descriptive modeling analysis
Cluster analysis
Exploratory data analysis


Converted information to provide insights about historical patterns and future trends is known as:
Meta-data
Clustering
Knowledge
Linear regression


What is Summarization?
A task focusing on discovering the most significant changes in the data from previously measured or normative values
A descriptive task where one seeks to identify a finite set of categories to describe the data.
The process of finding a model which describes significant dependencies between variables
Methods for finding a compact description for a subset of data.


Which of the following is NOT a method of combining multiple models into an ensemble model?
Bootstrapping
Averaging
Stacking
Voting


"In 2% of the purchases at the hardware store, both a pick and a shovel were bought,” is an example of:
Support
Topology
Supervised learning
Validation


Which of the following applications are usually used to classify students' performances?
Regression analysis
If...then... analysis
Cluster analysis
Market-basket analysis


Which of the following properties is a constraint on a RESTful application?
returns JSON output
stateful
linearly seperable
stateless


Which of the following algorithms produces decision trees?
DBSCAN
ID3
logistic regression
none of the above


The component of the Hadoop Distributed Filesystem responsible for storing metadata is called the
FS Shell
Datanode
Namenode
DFSAdmin


Which xpath selector expression captures all link elements of the form 'http://example.com/profile/12345' in an html page while excluding all links of the form 'http://example.com/casenumber/12345?
//href/profile
//a/profile
//a/[contains(@href, "profile")]/@href
//a/[contains(@href, "profile")]


What is Pig
A programming language that enables Hadoop to operate as a data warehouse.
A programming language that simplifies the common tasks of working with Hadoop.
None of these


If more than one value occurs the same number of times, the data is:
Multi-leafed
Multi-modal
Multi-faceted
Multivariated


Taking multiple random samples of data and building a classification model for each is known as:
Clustering
Boosting
Binning
Fuzzy Sampling


A commonly used continuous alternative to the step function in multi-layered neural network output is the
logistic function
hyperbolic function
logarithmic function
multi-layered NN cannot compute continuous output


The authentication protocol used by many significant web APIs is called:
SSL
PGP
OAuth
HTTPS


What is CURL?
A methodology for classifying hidden features of data
The part of HTTP that specifies access permission
Combinatorial Unsupervised Recursive Learning algorithm
A command-line tool for retrieving files


Apriori is a seminal algorithm for finding frequent item sets using:
Candidate generation
Normal mixture models
None of these
Overfitting methods


What is the first step in the business understanding phase?
Firmly grasp business objectives and needs
Create data mining goals to achieve the business objectives
Create a list of all relevant algorithms to be applied to the task
Assess the current situation by finding out the resources, assumptions, constraints etc.


In any numerical data set with a meaningful mean value, what is the minimum fraction of data that will fall within n standard deviations of the mean?
1/n^2
1/n
1-1/n^2
1/2n


Which of these is a possible architecture of a data mining system?
Transitive coupling
Magnetic coupling
No-coupling
Quickstart coupling


Which of these is not a step in the KDD process?
Data Mining
Data Cleaning
Data Integration
Data Quantification


Which of the following method can be used for modeling a categorical target variable?
Regression
Non-Linear Regression
All of the Above
ARIMA
Logistic Regression


The level of the model that specifies the strengths of the dependencies using some numerical scale.
Quantitative Level
Numeric Level
Primary Level
Dependency Level


Which of the following is not a primary phase of a Hadoop Reducer?
Reduce
Map
Sort
Shuffle


The measured differences between a model and its predictions are known as:
Non-applicable data
Noise
Range
Outliers


True or False? Artificial neural networks are linear predictive models.
False
True


Which are popular data mining methods?
Probabilistic Graphical Dependency Models
Decision Trees and Rules
Relational Learning Models
All of these


Which decision tree method performs multi-level splits when computing classification trees?
ID3 (Iterative Dichotomiser 3)
CHAID (Chi Square Automatic Interaction Detection)
C4.5 algorithm
CART (Classification and Regression Trees)


What is the advantage of the k-Medoids Clustering Algorithm over the k-Means Clustering (Lloyd's) Algorithm?
represents clusters by center
more resistant to outliers
all of the above
uses iterative refinement


Which of the following is not an appropriate tool for harvesting data from a website that accesses its database through Javascript/AJAX calls?
wget
PhantomJS
Selenium
All of the above are appropriate


Which of the following is part of a retail customer data mining strategy?
loyalty cards
holiday sale
customer testimonials
money-back guarantee


Hash based technique, Transaction Reduction, Portioning, Sampling, and Dynamic Item Counting are all examples of what?
Techniques to improve the efficiency of an Apriori algorithm
Method to repeatedly scan the scan the database and check a large set of candidates by pattern matching.
Methods of generating frequent item sets without candidate generation.
Methods for finding a compact description for a subset of data.


Which of the following is not valid JSON?
All are valid
{"answer": "this one"}
{["answer": "this one"]}
{"answer": ["this one"]}


The two major functions of BI servers are:
Management and delivery
Processing and management
Application and delivery
Source and results


How do you measure interestingness in association patterns?
meaure accuracy
measure relevance
measure variance
measure lift


A descriptive approach to exploring data that can help identify relationships among values in a database is:
Clustering
Predictive analysis
Link analysis
Function activation


Where can a website operator generally find data on her customers' IP addresses?
cookies
all of the above
HTTP request headers
server logfiles


Data mining provides a link between:
Online analytical processing and dynamic information
Genetic algorithms and logistic regression
Parallel processing and RAID
Separate transactional and analytical systems


What is Hive
Both of these
Hive is a programming language that simplifies the common tasks of working with Hadoop.
Hive enables Hadoop to operate as a data warehouse.


What is the purpose of the Hadoop Distributed File System (HDFS)?
Creating a context in which there are no restrictions on the data, enabling it to be unstructured and schemaless.
To enable computation to take place by allowing each server to have access to the data.
Ensuring that data is replicated with redundancy across the cluster.
All of these.


The silhouette coefficient can be used to determine the natural number of clusters for ________.
Hierarchichal Algorithms
Partitioning Algorithms
Density Based Algorithms
Subspace Clustering Algorithms