Abstract:Abstract: Query processing is a
strategy for getting data from the database dependably. The execution of the
database framework relies upon the query processing strategies that we utilized
in the database system. Regularly, databases must have the capacity to reply to
the clients request in getting data, In vast database frameworks, we see that
they may keep running on unpredictable and, unstable environment then it turns
out to be difficult to produce database queries efficiently based on the
information that is accessible at the compile time, getting the database result
in a timely manner deals with the procedure of query optimization. Productive
processing of queries is an essential prerequisite in numerous intuitive
environments that include a large amount of information. This paper explains
the effect of query processing and optimization on the distributed database
which requires the transmission of information between PCs in a network. The
arrangement of information transmissions and local data processing is known as
a distribution strategy for a query. Two cost measures, response time and total time, which are
utilized to judge the quality of a distribution strategy. Moreover, different
algorithms are utilized that infer distribution methodologies which have a
minimal response time and minimal total time, for a special class of queries to
determine the performance of the DDB.
Keywords: query processing, query optimization,
Keywords: query processing, query optimization,
In general, the Database system should be able
to replay requests of its users. Getting data or information from a database
system deals with Query Processing, and returning back the result at a
convenient time managed by Query Optimization. The Query Processing and Query
Optimization are the essential part of RDBMS, the result of queries should
return to its users such as a person, robotic assembly machine or another
different DBMS in a timeframe that submitted by the user 5. The Query
Processing displays the performance of the database while the Query
Optimization displays the response time of the database system.
Furthermore, a Database Query is a request for
ordering data from RDBMS to modify or restore specific data, updating and
restoring data is performed through different low-level operations in RDBMS,
and they also could be relational algebra operations such us project, join,
select, Cartesian product, etc. 5.
A Relational Database Management System RDBMS is
a specific type of DBMS which uses a relational model, it lets user store data
in multiple tables which are related together by mutual fields, and it’s also
the most popular type of database system such as MS SQL Server, DB2, Oracle and
MySQL, Database Management System DBMS store data in a way that is easier to
return manipulate and manufacture information, it enables users to form and
manage database and data also can be accessed by multiple users in different
locations, it also lets user create, read, update, and delete data in database,
and the DBMS can control how an end-user can view data by giving the users
different permission to access data in database, users of DBMS can be
classified in to three types:
The query processing and query optimization are
the most important component of RDBMS “these components are responsible for
translating a user query, usually written in a non-procedural language like SQL
– into an efficient query evaluation program that can be executed against the
database.” (Saurabh Gupta, Gopal Singh Tandel, UmashankarPandey, 2015)8.
Moreover, the query processing and optimization
also have an important role in distributed database (DDB) in term of the
performance of the database which measured by different algorithms, in DDB data
distributed on various sites we can access those data by query requests, the
query processing and optimization utilize the best way for the query to promote
the execution of the query, in distributed database queries are impacted by:
Insertion method of the data to the server.
Transport time among servers.
The response time of the query is depending on
transmission time between servers 1.
This paper will explain the effect of query processing
and optimization on the distributed database (DDB) including (response time and
transmission cost) by explaining some algorithms.
1.1 Distributed Database
A distributed database is a collection of
databases that can be kept in a various computer networksite “A distributed
database (DDB) is a collection of multiple, logically interrelated databases
distributed over a computer network” (Swati Gupta, KuntalSaroha, Bhawna, 2011),
a distributed database management system (DDBMS) is a software that allow the
management of the DDB make the distribution clear to the users; each database
may include different DBMS and different architectures that distribute the
pursuance of procedure 10
It also has an important function nowadays when
all sorts of users should be related to the companies’ database, additionally
to the company’s own employees such as customers, potential customers and
venders need to access to the information in the databases 9,10.
The idea of the DDB is to store data in the
different database over the network instate of having those data in a single
database, those data also accessible by different user from different places
9. Moreover, people can access those data with the help of query 2. The processing
of distributed query is collected of the following stages 9:
A distributed database has several benefits
database management system (DDBMS) prop the creation and repairing of
distributed databases, where data are kept at various sites connected through a
network. An objective of DDBMS is to present an easy and united interface to
the users so that they can access the databases as if there were a single
database. Another important thematic of DDBMS is to operate distributed queries
effectively in addition to providing availability and reliability 3.
A Query Processing is an execution to
converting a high-level query in to a low level-language. Most of the queries
that suggested to the DBMS are in the high-level language such as SQL, through
the Parsing and Translation stage the human readable form is converted to the
form that used by DBMS which contain relational algebra expression, query tree
and query graph 5. The query processing methods for multiple dimensions are
divided in to five different steps bellow 7.
1. Selection Query Model.
2. Data access model.
4. Query and Data uncertainty.
5. Ranking Function.
The transformation of the high-level query to
the low-level query by Query Processing is going through virus level as bellow
Parsing and Translation: In
this step a query submitted to DBMS to change the query to the usable form in
the high-level query language such as SQL which is show the query as a string
or sequence of characters 5 7.
Optimization:In this stepthe
query processor gives role to the inner data structure to change this structure
to the equivalent. But more effective exemplification. 5 7.Fig.1
steps of processing high-level query 6
Evaluation:The last step
of the processing a query, in this step the best estimate plan nominee
generated by optimization engine which is first selected then executed 5 7.
The figure bellow illustrates the steps of
Query Processing in Database 8.
Query Processing in Database 8.
processing is an important solicitude in the area of distributed databases.
determine the concatenation and the sites for executing this set of operations
such that the operating cost (communication cost and processing cost) for
processing this query is decreased, the query processing not only depends on
the operations of the query, it also depend on the parameter values that linked
with the query. Distributed query processing has an important impacton the
performance of a distributed database system 3.
The Query Optimization is responsible to return
back the most effective result after exclusion by using plan in the timely
manner, the Query Optimization finds a plan to decrease the overall execution
cost of a query, the process of choosingthelower-cost mechanism is known as
Cost-Based Optimization and there are two other strategies to reduce the
executioncostof a query which are 5:
The Query Optimization also has three
principles which are 6:
It characterizes the transformations, target
language and the source language, and how to build a target language from
premier query, the target language reverse the aspect of run time when the QEP
Example: “Physical representation of hash tables, an
index which determines the usage of varies varieties of access operators.
Operators implementing various join methods and index the QEP usage” (Dr.K.
Kiran Kumar, T.M. SanthiSri ,VorugantiVamshipriya, 2015).
User submitted queryestimated by some various
QEP which are utilized to build options to find appropriate candidate.
This is utilizing to liken of various QEP and
finding the best tobring accurate result.
Example (Dr.K. Kiran Kumar, T.M. SanthiSri,VorugantiVamshipriya,
where salary< 3000 This is translated to the following relational algebraic way: ? salary < 3000 (? salary (balance)) (? salary (balance)) ? balance (? salary It is also represented in the following tree method: ? salary<3000 ? salary | | ? salary ? salary< 3000 | | balancebalance Query optimization is a difficult mission in a distributed database as data location becomes a main operator. In order to optimize queries carefully, the adequate information should be available to define, the data access techniques are most functionalfor instance: table and column cardinality, organization information, and index availability. Optimization algorithms have an important effect on the performance of distributed query processing 3. 2. Literature Review In query processing, users of the database mostly assign what data wanted instead of assigning the process to restore required data, therefore, the most important part of query processing is query optimization which is responsible for finding the best way to perform queries in database 2. Additionally, both query processing and optimization have a significant impact on the performance of the distributed database (DDB), there are many methods to optimize queries and used to improve the execution of the distributed database which are explored by studies. One of the study found that the join query can be optimized in distributed database by comparing two methods; The first method for the join query is to transmit data from server to client and then insert data into the client DB then the join-query is executed. The second method, immediately execute the join-query on the client after bringing data from server site and it will not append data to the client DB, from this method the insertion time of the data to the client DB will be cut. Consequently this method is optimizing the join-query in DDB (PawandeepKaur, 2013) 1. Another study by (MonjurulAlom, FraceHenskens and Michael Hannaford, 2009) 3, according to join and semi-join strategy they explained different methods such as (Fragmentation and Replication Strategy FRS and Partition and Replicate Strategy PRS) to processing a query while all relations that indicated by a query were not fragmented but they distributed in various sites, this technique is utilized to define the relation that segmented into fragments, and where the fragments forward to processing, furthermore, this method is to process availing parallelism and decreasing the quantity of transporting data for the site, it also supply better capacity for query processing cost when the specific query indicate one relation or all relations for the various sites which display attributes of the query, the researchers in this study were more worried about "fragment more than one referenced non fragmented relations as FRS is not applicable to processing distributed queries in which all of the relations which are non-fragmented but referenced by a query" (MonjurulAlom, FraceHenskens and Michael Hannaford, 2009) 3, they explain these strategies based on six definition (D1-D6), they also describe distributed query optimization issues. Moreover, (AbhijeetRaipurkar and G. R. Bamnote, 2013) 9, they used two method (Simi join based query optimization algorithm, SDD-1 algorithm) to improve the performance of the distributed database, those optimization algorithms have an effective role on the performance of distributed query processing including (reducing response time of the query and the cost of the communication process) (AbhijeetRaipurkar, G. R. Bamnote, 2013) 9. The impact of query processing and optimization on distributed database had been discussed many years ago, in 1979 an article in IEEE Transactions on Software Engineering published and uploaded by Alan Hevnerin (2015) 4, this study explained Algorithm G for query processing which is a complete part of distributed database management system, this algorithm progressed and derive a strategy for a distributed query and it progressed in two step process which are: · Algorithm PARALLEL for response time. · Ordered SERIAL strategy for total time. These steps provide minimal response time of distribution for queries and minimization of total time 4. All studies explained some algorithms that improve the performance of the DDB, however, none of them proved which algorithm is the best one among them? Which one has the minimal response time and minimal cost transmission? 1. Query Algorithms: Questions are eventually lessened the numbers of data scan, operations on the hidden physical record structures, for each relational operation, there can exist a few diverse access ways to the specific records required. The query execution engine can have a large number of specific methods intended to process specific relational operation and access way combination, there are two types of algorithms as follows5 7: 3.1 Selection Algorithms The Select operation must look through the information documents for records meeting the choice criteria. 3.2 Join Algorithms Like selection, the join operation can be executed in an assortment of ways. In terms of disk accesses, the join operations can be exceptionally costly, so executing and using proficient join algorithm is critical in minimizing a query's execution time. 2. Some Query Processing and Optimization Methods in DDB There are various methods to process and optimize queries in database, these methods promote the performance of the query and it also decrease the cost, the optimizer define in which order the query request such as (Joins, Selects, and Projects) should be executed 1. These methods are responsible for returning data in a minimal time with the minimal cost of the transmission. The primal operation that utilized to extract the wanted information from tables (one table, two table or multiple tables) is join and semi join methods, there are different measurements to consider the performance of join and semi join in distributed database system (DDBS) such as (Query Cost, Memory used, CPU Cost, Input Output Cost, Sort Operations, Data Transmission, Total Time and Response Time), the Join method is the most peremptory operation in database that utilized to bring data from two or more than two tables 2. There are some algorithms that improve the performance of distributed database: 2.1.First Algorithm: The parallel query processing method, Join Query Optimization 1 This algorithm focuses on maximizing the number of simultaneous transmission rather than minimizing the quantum of the transmission. Fig. 3 the parallel processing of join query 1 Example: this example taken from 1 Client sends the request for the data from server 1 and server 2 by the queries. After that, server 1 sends the SUPPLY data and server 2 sends the SUPPLIER data to client. Then client inserts the data into its database and performs the join query on the data from two servers If server 1 contains the SUPPLY relation as: SUPPLY(SUPPLY_NO, FROM_PLACE, TO_PLACE) and server 2 contains the SUPPLIER relation as: SUPPLIER(SUPPLY_NO, S_NAME, S_ADDRESS) and client wants the join of the SUPPLY and SUPPLIER relation from server 1 and server 2 respectively and want to perform the query Q. Q: SELECT *FROM SUPPLY S, SUPPLIER sr WHERE s.SUPPLY_NO = sr.SUPPLY_NO In distributed databases, query Q can be divided into three parts: 1. SELECT *FROM SUPPLY 2. SELECT *FROM SUPPLIER 3. SELECT *FROM SUPPLY S, SUPPLIER sr WHERE s.SUPPLY_NO = sr.SUPPLY_NO Queries 1 and 2 select the data from two source tables. Because this data resides on the remote machines, the executions of these two queries do not require data transmission. Query 3 is the join query which cannot be executed until the data on the remote sites have been transferred to the same sites. There are some objects for the optimization to execute the distributed join query that is accessing the data from the distant sites which are: · Size of transmitted data: it's the amount of data that should be transmitted; this data should have the small size in order to take less time for transmission. · Transmission speed: this object relies on the network speed. · Local processing costs: it contains CPU cost, I/O cost, local processing costs can differ with the machine processing speed, these costs should be small and the operations should be executed in an effective way for optimizing query in order to raise the performance of join query. The parallel processing of join query in distributed database depends on some factors: · Time for transmitting data: if the quantity of transmitted data increased then the transmission time also increase, which is depend on the network speed between source and destination. · Time for inserting data: two different operations used to count time taken to add data into the client database from server: 1. Row-by-row insertion. 2. Bulk insertion. · Time for join execution:the execution time of the join query from the client side that joins attributes from server A and server B. The optimization of join query can be performed by using: 1. Various join orders. 2. Alternative "where" clause that will give the same result. 3. Various join methods. 4. Local processing cost of query such as CPU cost and I/O cost. In order to increase the performance of the join query the cost of the transmission should be less; the transmission cost and the insertion cost are the most important if the machine wants the outcome of the join query of data from different machines. Moreover, Insertion from server's data into client database needs the more time than the transmission of the data. Subsequently, the important factor to promote the performance of the join query is bringing data from different sources by using the various insertion functions that take less time. Consequently, in this method, the insertion time of data into client DB will be deducted, then join query will be optimized in distributed databases. 1 4.2.Second Algorithm: SDD-1 2 This algorithm uses the semi-join algorithm to lug the connection between the relationships and to break them; the SDD-1 algorithm has three important advantages as follows: 1. It uses the semi-join operation to lug strategy. 2. The relationship of the whole sites is not duplicate and fragmented. 3. During price rating of the whole algorithm, the transmission cost to the starting site is not calculated. The SDD-1 algorithm is formed of two parts: 1. The basic algorithm. 2. The post-optimality. The SDD-1 algorithm doesn't make full use of the individual of distributed database system. All of the semi join operations are running in order and it will rise the response time of query to an assured range. 4.3.Third Algorithm: Algorithm G 4 The characteristic of this algorithm that make the optimization of query effective is that data transmission is quickly eliminated from estimation if they cannot be a section of a minimal time table, the number of the transmissions is smaller than the number of all possible data transmissions, this algorithm also can be performed utilizing the minimization of response time or total time as it's cost objective. Algorithm G has five steps as follows: 1. (Initialization.) After initial local processing, order the relations so that sl< s2 4 * * < s,. For each joining domain di, of each relation Ri. 2. Repeat Step 3) for each Ri, i = 1, * * *, m, then GO TO Step 4). 3. Build candidate schedules for Ri. 4. Integrate schedules. 5. Build strategy. In the term of response time all possible parallel data transmissions for that mutual joining area are included and contribute to the selectivity of the considered transmission. For total time these parallel transmissions are not included, considering parallel transmissions to minimize the response of a schedule raises the complication of Algorithm G by a considerable amount while the decrease in schedule response time is restricted. This algorithm also progressed and derives a strategy for a distributed query and it progressed in two step processes which are: · Algorithm PARALLEL for response time. · Ordered SERIAL strategy for total time. Consequently, the analysis of Algorithm G is executed only for minimizing total time. Conclusion The most critical functional requirements of a database framework are its capacity to process queries in a timely manner this process is a responsibility of query processing and optimization, the query processing and optimization in distributed framework requires the transmission of information between PCs in a network. The arrangement of information transmissions and local information preparation is known as a distribution technique for a query. Two cost measures, response time and total time, they utilized to judge the quality of a distributed database and there are many algorithms that use to measure the performance of DDB and show the impact of query processing and optimization on the distributed database.